AN INXRODUCXION 
XO INFORM AXION XHEORY 




AN INTRODUCTION 
TO INFORMATION THEORY 


Fazlollah M. Reza 

Professor of Electrical Engineering 
Syracuse University 



TATA McGRAW-HILL PUBLISHING CO. LTD. 
Bombay-New Delhi 




AN INTRODUCTION TO INFORMATION THFORY 

1961 



Tn tHa 'tiiGmnTy nf 

T-AI HA.F^I 

Art 1 ThS'p'ircil 'T'cclcHgt 




PREFACE 


Statistical theory of communication is a broad new field comprised of 
methods for the study of the statistical problems encountered in all types 
of communications. The field embodies many topics such as radar 
detection, sources of physical noise in linear and nonlinear systems, 
filtering and prediction, information theory, coding, and decision theory. 
The theory of probability provides the principal tools for the study of 
problems in this field. 

Information theory as outlined in the present work is a part of this 
broader body of knowledge. This theory, originated by C. E. Shannon, 
introduced several important new concepts and, although a part of 
applied communications sciences, has acquired the unique distinction of 
opening a new path of research in pure mathematics. 

The communication of information is generally of a statistical nature, 
and a current theme of information theory is the study of simple ideal 
statistical communication models. The first objective of information 
theory is to define different types of sources and channels and to devise 
statistical parameters describing their individual and ensemble operations. 

The concept of Shannon’s communication entropy of a source and the 
transinformation of a channel provide most useful means for stud3dng 
simple communication models. In this respect it appears that the con- 
cept of communication entropy is a type of describing function that is 
most appropriate for the statistical models of communications. This is 
similar in principle to the way that an impedance function describes a 
linear network, or a moment indicates certain properties of a random 
variable. The introduction of the concepts of communication entropy, 
transinformation, and channel capacity is a basic contribution of informa- 
tion theory, and these concepts are of such fundamental significance that 
they may parallel in importance the concepts of power, impedance, and 
moment. 

Perhaps the most important theoretical result of information theory is 
Shannon’s fundamental theorem, which implies that it is possible to 
communicate information at an ideal rate with utmost reliability in the 
presence of “noise.” This succinct but deep statement and its conse- 
quences unfold the limitation and complexity of present and future 

ix 



X 


PREFACE 


methods of communications. Mastery of the proof offers many fringe 
benefits to those interested in the analysis and synthesis of communication 
networks. For this reason we have included several methods of proof 
of this theorem (see Chapters 4 and 12). However, the impatient reader 
is forewarned that the proof entails much preparation which may prove 
to be burdensome. 

This book originated about five years ago from the author's lecture 
notes on information theory. In presenting the subject to engineers, 
the need for preliminary lectures on probability theory was observed. 
A course in probability, even now, is not included in the curriculum of a 
majority of engineering schools. This fact motivated the inclusion of an 
introductory treatment of probability for those who wish to pursue the 
general study of statistical theory of communications. 

The present book, directed toward an engineering audience, has a 
threefold purpose : 

1. To present elements of modern probability theory (discrete, con- 
tinuous, and stochastic) 

2. To present elements of information theory with emphasis on its 
basic roots in probability theory 

To present elements of coding theory 

Thus this book is offered as an introduction Lo probability, information, 
and coding theory. It also provides an adecpiate treatment of proba- 
bility theory for those who wish to pursue topics other than information 
theory in the field of statistical theory of communications. 

One feature of the book is that it recpiires no formal prerequisites 
except the usual undergraduate mathematics ineJuded in an engineering 
or science program. Naturally, a willingness to consult other references 
or authorities, as necessary, is presumed. The subject is presented in 
the light of applied mathematics. The immediate involvement in tech- 
nological specialities that may solve specific problems at the expense of 
a less thorough basic understanding of the theory is thereby avoided. 

A most important, though indirect, application of information theory 
has been the development of codes for transmission and detection of 
information. Coding literature has grown very rapidly since it, presum- 
ably, applies to the growing field of data processing. Chapters 4 and 
13 present an introduction to coding theory without recourse to the use 
of codes. 

Ihe book has been divided into four parts: (1) memoryless discrete 
schemes, (2) memorylcss continuum, (3) schemes with memory, and (4) 
an outline of some of the recent developments. The appendix contains 
some notes which may help to familiarize the reader with some of the 
literature in the field. The inclusion of many reference tables and a 
bibliography with some 200 entries may also prove to be useful. 



PREFACE 


XI 


The emphasis throughout the book is on such basic concepts as sets, 
the probability measure associated with sets, sample space, random vari- 
ables, information measure, and capacity. These concepts proceed from 
set theory to probability theory and then to information and coding 
theories. The application of the theory to such subjects as radar 
detection, optics, and linguistics was not undertaken. We make no 
pretension for ''usefulness” and immediate application of information 
theory. From an educational standpoint, it appears, however, that the 
topics discussed should provide a suitable training ground for communi- 
cation scientists. 

The most rewarding aspect of this undertaking has been the pleasure 
of learning about a new and fascinating frontier in communications. By 
working on this book, I came to appreciate fully many subtle points and 
ingenious procedures set forth in the papers of the original contributors 
to the literature. I trust this attempt to integrate these many contribu- 
tions will prove of value. Despite pains taken by the author, inac- 
curacies, original or inherited, may be found. Nevertheless, I hope the 
reader will find this work an existence proof of Shannon\s fundamental 
theorem; that "information” can be transmitted with a high degree of 
reliability at a rate close to the channel capacity despite all forms of 
“noise.” 

At any rate, there is an eternal separation between what one strives for 
and what one actually achieves. As Leon von Montenaeken wrote. 

La vie est breve, 

Un peu d'espoir, 

Un peu de reve, 

Et puis — bonsoir. 


Fazlollah M, Reza 




ACKNOWLEDGMENTS 


The author wishes to acknowledge indebtedness to all those who have 
directly or indirectly contributed to this book. Special tribute is due to 
Dr. C. E. Shannon who primarily initiated information theory. 

Dr. P. Elias of the Massachusetts Institute of Technology, has been 
generous in undertaking a comprehensive reading and reviewing of the 
manuscript. His comments, helpful criticism, and stimulating discussions 
have been of invaluable assistance. 

Dr. L. A. Cote of Purdue University has been very kind to read and 
criticize the manuscript with special emphasis upon the material on 
probability theory. His knowledge of technical Russian literature and 
his unlimited patience in reading the manuscript in its various stages of 
development have provided a depth that otherwise would not have been 
attained. 

Thanks are due to Dr, E. N. Gilbert of Bell Telephone Laboratories 
and Dr. J. P. Costas of General Electric Company for helpful comments 
on the material on coding theory; to Prof. W. W. Harman of Stanford 
University for reviewing an early draft of the manuscript; to Mr. L. 
Zafiriu of Syracuse University who accepted the arduous task of proof- 
reading and who provided many suggestions. 

In addition numerous scientists have generously provided reprints, 
reports, or drafts of unpublished manuscripts. The more recent mate- 
rial on information theory and coding has been adapted from these cur- 
rent sources but integrated in our terminology and frame of reference. 
An apology is tendered for any omission or failure to reflect fully the 
thoughts of the original contributors. 

During the past four years, I had the opportunity to teach and lecture 
in this field at Syracuse University, International Business Machines 
Corp., General Electric Co., and the Rome Air Development Center. 
The keen interest, stimulating discussions, and friendships of the scien- 
tists of these centers have been most rewarding. 

Special acknowledgment is due the United States Air Force Rome 
Air Development Center and the Cambridge Research Center for sup- 
porting several related research projects. 

xiii 



XIV 


ACKNOWLEDGMENTS 


I am indebted to my colleagues in the Department of Electrical Engi- 
neering at Syracuse University for many helpful discussions and to Mrs. 
H. F. Laidlaw and Miss M. J. Phillips for their patient typing of the 
manuscript. 

I am particularly grateful to my wife and family for the patience which 
they have shown. 



CONTENTS 


PREFACE ix 

CHAPTER I Introduction 

1-1 . Communication Processes 1 

1-2. A Model for a Communication System 3 

1-3. A Quantitative Measure of Information ... .5 

1-4. A Binary Unit of Information 7 

1-5. Sketch of the Plan 9 

1-6. Main Contributors to Information Theory .... 11 

1- 7. An Outline of Information Theory 14 

Part 1 : Discrete Schemes without Memory 

ytlHAPTER 2 Basic Concepts of Probability 

2- 1. Intuitive Background 19 

2-2. Sets 21 

2-3. Operations on Sets 23 

2-4. Algebra of Sets 24 

2-5. Functions 30 

2-6. Sample Space 34 

2-7. Probability Measure 36 

2-8. Frequency of Events 38 

2-9. Theorem of Addition 40 

2-10. Conditional Probability 42 

2-11. Theorem of Multiplication 44 

2-12. Bayes’s Theorem 46 

2-13. Combinatorial Problems in Probability 49 

2-14. Trees and State Diagrams 52 

2-15. Random Variables 58 

2-16. Discrete Probability Functions and Distribution . . 59 

2-17. Bivariate Discrete Distributions 61 

2-18. Binomial Distribution 63 

2-19. Poisson’s Distribution 65 

2-20. Expected Value of a Random Variable 67 


XV 



Xvi CONTENTS 

CHAPTER 3 Basic Concepts of Information Theory: 

Memoryless Finite Schemes 

3-1, A Measure of Uncertainty 76 

3-2. An Intuitive Justification 78 

3-3. Formal Requirements for the Average Uncertainty. 80 

3-4. H Function as a Measure of Uncertainty .... 82 

3-5. An Alternative Proof That the Entropy Function 

Possesses a Maximum 86 

3-6. Sources and Binary Sources 89 

3-7. Measure of Information for Two-dimensional Discrete 

Finite Probability Schemes 91 

3-8. Conditional Entropies 94 

3-9. A Sketch of a Communication Network 96 

3-10. Derivation of the Noise Characteristics of a Channel . 99 

3-11. Some Basic Relationships among Different Entropies . 101 

3-12. A Measure of Mutual Information .... 104 

3-13. Set-theory Interpretation of Shannon’s Fundamental 

Inequalities 106 

3-14. Redundancy, Efficiency, and Channel Capacity 108 

3-15. Capacity of Channels with Symmetric Noise Structures 111 

3-16. BSCandBEC 114 

3-17. Capacity of Binary Channels 115 

3-18. Binary Pulse Width Communication Channel . 122 

3- 19. Uniqueness of the Entropy Function 124 

CHAPTER 4 Elements of Encoding 

4- 1. The Purpose of Encoding 131 

4-2. Separable Binary Codes 137 

4-3. Shannon-Fano Encoding 138 

4-4. Necessary and Sufficient Conditions for Noiseless 

Coding 142 

4-5. A Theorem on Decodability 147 

4-6. Average Length of Encoded Messages 148 

4-7. Shannon’s Binary Encoding 151 

4-8. Fundamental Theorem of Discrete Noiseless Coding . 154 

4-9. Huffman’s Minimum-redundancy Code 155 

4-10. Gilbert-Moore Encoding 158 

4-11. Fundamental Theorem of Discrete Encoding in Presence 

of Noise 160 

4-12. Error-detecting and Error-correcting Codes .... 166 

4-13. Geometry of the Binary Code Space 168 

4-14. Hamming’s Single-error Correcting Code .... 171 



CONTENTS Xvii 

4-15. Elias’s Iteration Technique 176 

4-16. A Mathematical Proof of the Fundamental Theorem of 

Information Theory for Discrete BSC 180 

4- 17. Encoding the English Alphabet 183 

Part 2 : Continuum without Memory 
jef^APTER 5 Continuous Probability Distribution and Density 

5- 1. Continuous Sample Space 191 

5-2. Probability Distribution Functions 192 

5-3. Probability Density Function 194 

5-4. Normal Distribution 196 

5-5. Cauchy’s Distribution 198 

5-6. Exponential Distribution 199 

5-7. Multidimensional Random Variables 200 

5-8. Joint Distribution of Two Variables: Marginal 

Distribution 202 

5-9. Conditional Probability Distribution and Density . 204 

5-10. Bivariate Normal Distribution 206 

5-11. Functions of Random Variables 208 

5- 12. Transformation from Cartesian to Polar Coordinate 

System 214 

RAFTER 6 Statistical Averages 

6- 1. Expected Values; Discrete Case 220 

6-2. Expectation of Sums and Products of a Finite Number 

of Independent Discrete Random Variables . . 222 

6-3. Moments of a Univariate Random Variable. 224 

6-4. Two Inequalities 227 

6-5. Moments of Bivariate Random Variables .... 229 

6-6. Correlation Coefficient 230 

6-7. Linear Combination of Random Variables .... 232 

6-8. Moments of Some Common Distribution Functions 234 

6-9. Characteristic Function of a Random Variable . 238 

6-10. Characteristic Function and Moment-generating 

Function of Random Variables 239 

6- 11. Density Functions of the Slim of Two Random 

Variables 242 

J2TIAPTER 7 Normal Distributions and Limit Theorems 

7- 1. Bivariate Normal Considered as an Extension of One- 

dimensional Normal Distribution 248 

7-2. Multinormal Distribution 250 

7-3. Linear Combination of Normally Distributed Inde- 
pendent Random Variables 252 



xviii 


CONTENTS 


7-4. Central-limit Theorem 254 

7-5. A Simple Random-walk Problem 258 

7-6. Approximation of the Binomial Distribution by the 

Normal Distribution 259 

7-7. Approximation of Poisson Distribution by a Normal 

Distribution .... 262 

7- 8. The Laws of Large Numbers 263 

CHAPTER 8 Continuous Channel without Memory 

8- 1, Definition of Different Entropies 267 

8-2. The Nature of Mathematical Difficulties Involved . 269 

8-3. Infiniteness of Continuous Entropy 270 

8-4. The Variability of the Entropy in the (k)ntinu()us Case 

with Coordinate Systems 273 

8-5. A Measure of Information in the Continuous Case 275 

8-6. Maximization of the Entropy of a Continuous Random 

Variable 278 

8-7. Entropy Maximization Problems . 279 

8-8. Gaussian Noisy Channels .... ... 282 

8-9. Transmission of Information in Presence of Additive 

Noise 283 

8-10. Channel Capacity in Presence of Gaussian Additive 
Noise and Specified Transmitter and Noise Average 

Power 285 

8-11. Relation between the Entropies of Two Related 

Random Variables 287 

8- 12. Note on the Definition of Mutual Information 289 

CHAPTER 9 Transmission of Band-limited Signals 

9- 1. Introduction 292 

9-2. Entropies of Continuous Multivariate Distributions . 293 

9-3. Mutual Information of Two Gaussian Random Vectors 295 

9-4. A Channel-capacity Theorem for Additive Gaussian 

Noise . . 297 

9-5. Digression 299 

9-6. Sampling Theorem . . 3 qO 

A Physical Interpretation of the Sampling Theorem 305 

9-8. The Concept of a Vector Space 308 

9-9. Fourier-series Signal Space 313 

9-10. Band-limited Signal Space .... ... 315 

9-11. Band-limited Ensembles 317 

9-12. Entropies of Band-limited Ensemble m Signal Space , 320 



CONTENTS 


XIX 


9-13. A Mathematical Model for Communication of 

Continuous Signals 322 

9-14. Optimal Decoding 323 

9-15. A Lower Bound for the Probability of Error 325 

9-16. An Upper Bound for the Probability of Error . 327 

9-17. Fundamental Theorem of Continuous Memoryless 

Channels in Presence of Additive Noise .... 329 

9-18. Thomasian's Estimate 330 


Part 3 : Schemes with Memory 

CHAPTER 10 Stochastic Processes 

10-1. Stochastic Theory 338 

10-2. Examples of a Stochastic Process 341 

10-3. Moments and Expectations 343 

10-4. Stationary Processes 344 

10-5. P>godic Processes . . 347 

10-6. Correlation Coefficients and Correlation Functions. . 349 

10-7. Example of a Normal Stochastic Process .... 352 

10-8. Examples of Computation of Correlation Functions . 353 

10-9. Some Elementary Properties of Correlation Functions of 

Stationary Processes 356 

10-10. Power Spectra and Correlation Functions .... 357 

10-11. Response of Linear Lumped Systems to Ergodic 

Excitation 359 

10-12. Stochastic Limits and Convergence 363 

10-13. Stochastic Differentiation and Integration .... 365 

10-14. Gaussian-process Example of a Stationary Process . 367 

10-15. The Over-all Mathematical Structure of the Stochastic 

Processes 368 

10- 16. A Relation between Positive Definite Functions and 

Theory of Probability 370 

CHAPTER 11 Communication under Stochastic Regimes 

11- 1. Stochastic Nature of Communication 374 

11-2. Finite Markov Chains 376 

1 1-3. A Basic Theorem on Ergodic Markov Chains . 377 

11-4. Entropy of a Simple Markov Chain 380 

11-5. Entropy of a Discrete Stationary Source .... 384 

11-6. Discrete Channels with Finite Memory 388 

1 1-7. Connection of the Source and the Discrete Channel with 

Memory 389 

11-8. Connection of a Stationary Source to a Stationary 

Channel 391 



XX 


CONTENTS 


Part 4: Some Recent Developments 

CHAPTER 12 The Fundamental Theorem of Information Theory 

Preliminaries 

12-1. A Decision Scheme 398 

12-2. The Probability of Error in a Decision Scheme . 398 

1 2-3. A Relation between Error Probability and Equivocation 400 
12-4. The Extension of Discrete Memoryless Noisy Channels 402 

Feinstein’s Proof 

12-5. On Certain Random Variables Associated with a Com- 
munication System .... 403 

12-6. Feinstein^s Lemma . 405 

12-7. Completion of the Proof ... 406 

Shannon's Proof 

12-8. Ensemble Codes 409 

12-9. A Relation between Transinformation and Error 

Probability 412 

12-10. An Exponential Bound for Error Probability 414 

WoLFOwiTz's Proof 

12-11. The Code Book 416 

12-12. A Lemma and Its Application 417 

12-13. Estimation of Bounds 419 

12- 14. Completion of Wolfowitz's Proof 421 

CHAPTER 13 Group Codes 

13- 1. Introduction 424 

13-2. The Concept of a Group 425 

13-3. Fields and Rings 428 

13-4. Algebra for Binary w-Digit Words 429 

13-5. Hamming’s Codes 431 

13-6. Group Codes 435 

13-7. A Detection Scheme for Group Codes 437 

13-8. Slepian’s Technique for Single-error Correcting Group 

Codes 438 

13-9. Further Notes on Group Codes 7 . 442 

13-10. Some Bounds on the Number of Words in a Systematic 

Code ... 446 

APPENDIX Additional Notes and Tables 

N-1. The Gambler with a Private Wire 450 

N-2. Some Remarks on Sampling Theorem 452 

N-3. Analytic Signals and the Uncertainty Relation 454 



CONTENTS 


XXI 


N-4. Elias’s Proof of the Fundamental Theorem for BSC . 457 

N-5. Further Remarks on Coding Theory 460 

N-6. Partial Ordering of Channels 462 

N-7. Information Theory and Radar Problems .... 464 

T- 1. Normal Probability Integral 465 

T-2. Normal Distributions 466 

T-3. A Summary of Some Common Probability Functions . 467 
T-4. Probability of No Error for Best Group Code ... 468 

T-5. Parity-check Rules for Best Group Alphabets . . . 469 

T-6. Logarithms to the Base 2 471 

T-7. Entropy of a Discrete Binary Source 476 

BIBLIOGRAPHY 481 

NAME INDEX 491 

SUBJECT INDEX 493 




CHAPTER 1 


INTRODUCTION 


Information theory is a new branch of probability theory with exten- 
sive potential applications to communication systems. Like several 
other branches of mathematics, information theory has a physical origin. 
It was initiated by communication scientists who were studying the 
statistical structure of electrical communication equipment. 

Our subject is about a decade old. It was principally originated by 
Claude Shannon through two outstanding contributions to the mathe- 
matical theory of communications in 1948 and 1949. These were fol- 
lowed by a flood of research papers speculating upon the possible applica- 
tions of the newly born theory to a broad spectrum of research areas, 
such as pure mathematics, radio, television, radar, psychology, semantics, 
economics, and biology. The immediate application of this new disci- 
pline to the fringe areas was rather premature. In fact, research in the 
past 5 or 6 years has indicated the necessity for deeper investigations into 
the foundations of the discipline itself. 

Despite this hasty generalization which produced several hundred 
research papers (with frequently unwarranted conclusions), one thing 
became evident. The new scientific discovery has stimulated the interest 
of thousands of scientists and engineers around the world. 

Our first task is to present a bird’s-eye view of the subject and to 
specify its place in the engineering curriculum. In this chapter a 
heuristic exposition of the topic is given. No effort is made to define 
the technical vocabulary. Such an undertaking requires a detailed logical 
presentation and is out of place in this informal introduction. However, 
the reader will find such material presented in a pedagogically prepared 
sequence beginning with Chap. 2. This introductory chapter discusses 
generalities, leaving a more detailed and precise treatment to subsequent 
chapters.* The specialist interested in more concrete statements may 
wish to forgo this introduction and begin with the body of the book.* 

1-1. Communication Processes. Communication processes are con- 
cerned with the flow of some sort of information-carrying commodity in 

* With the exception of Sec. 1-7, which gives a synopsis of information theory for 
the specialist. 


1 



2 


INTBODUCTION 


some network. The commodity need not be tangible; for example, the 
process by which one mind affects another mind is a communication pro- 
cedure. This may be the sending of a message by telegraph, visual com- 
munication from artist to viewer, or any other means by which informa- 
tion is conveyed from a transmitter to a receiver. The subject matter 
deals with the gross aspects of communication models rather than with 
their minute structure. That is, we concentrate on the over-all per- 
formance of such systems without being restrained to any particular 
equipment or organ. Common to all communication processes is the 
flow of some commodity in some network. While the nature of the com- 
modity can be as varied as electricity, words, pictures, music, and art, 
one could suggest at least three essential parts of a communication system 

(Fig. 1-1): 

1. Transmitter or source 

2. Receiver or sink 

3. Channel or transmission network which conveys the communique 
from the transmitter to the receiver 


Transmitter Channel Receiver 

Fio. 1-1. The model of a communication system. 

This is the simplest communication system that one can visualize. 
Practical cases generally consist of a number of sources and receivers and 
a complex network. A familiar analogous example is an electric power 
system using several interconnected power plants to supply several towns. 

In such problems one is concerned with a study of the distribution 
of the commodity in the network, defining some sort of efficiency of 
transmission and hence devising schemes leading to the most efficient 
transmission. 

When the communique Is tangible or readily measurable, the problems 
encountered in the study of the communication system are of the types 
somewhat familiar to engineers and operational analysts (for instance, 
the study of an electric circuit or the production schedule of a manu- 
facturing plant). When the communique is “intelligence” or ‘^informa- 
tion,” this general familiarity cannot be assumed. How does one define 
a measure for the amount of information? And having defined a suitable 
measure, how does one apply it to the betterment of the communication 
of information? 

To mention an analog, consider the case of an electric power network 
transmitting electric energy from a source to a receiver (Fig. 1-2). At 
the source the electric energy is produced with voltage F,. The receiver 
requires the electric energy at some prescribed voltage Vr. One of the 



INTRODUCTION 


3 


problems involved is the heat loss in the channel (transmission line). 
In other words, the impedance of the wires acts as a parasitic receiver. 
One of the many tasks of the designer is to minimize the loss in the trans- 
mission lines. This can be accomplished partly by improving the quality 
of the transmission lines. A parallel method of transmission improve- 
ment is to increase the voltage at the input terminals of the line. As is 
well known, this improves the efficiency of transmission by reducing 
energy losses in the line. A step-up voltage transformer installed at 
the input terminals of the line is appropriate. At the output terminals 
another transformer (step-down) can provide the specified voltage to the 
receiver. 

Without being concerned about mathematical discipline in this intro- 
ductory chapter, let us ask if similar procedures could be applied to the 
transmission of information. If the channel of transmission of informa- 
tion is a lossy one, can one still improve the efficiency of the transmission 



Fio. 1-2. An example of a communication system. 


by procedures similar to those in the above case ? This of course depends, 
in the first place, on whether a measure for the efficiency of transmission of 
information can be defined. 

1-2. A Model for a Communication System. The communication 
systems considered here are of a statistical nature. That is, the per- 
formance of the system can never be described in a deterministic sense ; 
rather, it is always given in statistical terms. A source is a device that 
selects and transmits sequences of symbols from a given alphabet. Each 
selection is made at random, although this selection may be based on some 
statistical rule. The channel transmits the incoming symbols to the 
receiver. The performance of the channel is also based on laws of chance. 
If the source transmits a symbol, say A, wfth a probability of P{ A } and 
the channel lets through the letter A with a probability denoted by 
P{ A|A}, then the probability of transmitting A and receiving A is 

PIA} -PIAIA) 

The communication channel is generally lossy; i.e., a part of the trans- 
mitted commodity does not reach its destination or it reaches the destina- 
tion in a distorted form. There are often unwanted sources in a com- 



INTRODUCTION 


4 

munication channel, such as noise in radio and television or passage of a 
vehicle in the opposite direction in a one-way street. These sources of 
disturbance arc generally referred to as noise sources or simply noise. 
An important task of the designer is the minimization of the loss and the 
optimum recovery of the original commodity when it is corrupted by the 
effect of noise. 

In the deterministic electrical model of Fig. 1-2, it was pointed out 
that one device which may be used to improve the efficiency of the system 
is (jailed a transformer. In the vocabulary of information theory a device 


Transmitter Encoder -H Channel Decoder Receiver 


Noise 

Fk;. 1-JT Onmil structure of a communication system used in information theory. 


that is used to improve the efficiency of the channel may be called an 
encoder. An encoded message is less susceptible to channel noise. At 
the receiver's terminal a decoder is employed to transform the encoded 
messages into the original form which is acceptable to the receiver. It 
could be said that, in a certain sense, for more “efficient" communication, 
the encoder performs a one-to-one mathematical mapping or an operation 
F on the input commodity /, F(/), while the decoder performs the inverse 
of that operation, 

Encoder: F I F{1) 

Decoder: F{1) I ^ ^ 

This perfect procedure is, of course, hypothetical; one has to face the 
ultimate effect of noise which in physical systems will prevent perfect 
communication. This is clearly seen in the case of the transmission of 
electrical energy where the transformer decreases the heat loss but an 
efficiency of 100 per cent cannot be expected. The step-up transformer 
acts as a sort of encoder and the step-down transformer as a decoding 
apparatus. 

Thus, in any practical situation, we have to add at least three more 
basic parts to our mathematical model: source of noise, encoder, and 
decoder (Pig. 1-3). The model of P^ig. 1-3 is of a general nature; it may 
be applied to a variety of circumstances. 

A novel application of such a model was made by Wiener and Shannon 
in their discussions of the statistical nature of the communication of 
messages. It was pointed out that a radio, television, teletype, or speech 



INTRODUCTION 


5 


transmitter selects setiuences of messages from a known transmitter 
vocabulary at random but with specified probabilities. Therefore, in 
such communication models, the source, channel, encoder, decoder, noise 
source, and receiver must be statistically defined. This point of view in 
itself constitutes a significant contribution to the communication sciences. 
In light of this view, one comes to realize that a basic study of communica- 
tion systems requires some knowledge of probability theory. Communi- 
cation theories cannot be adequately studied without having a good 
background of probability. Conversely, readers acquainted with the 
fundamentals of probability theory can proceed most efficiently with 
research in the field of communication. 

In the macroscopic study of communication systems, some of the basic 
questions facing us are these: 

1. IIow does one measure information and define a suitable unit for 
such measurements? 

2. Having defined such a unit, how docs one define an information 
source, or how does one measure the rate at which an information source 
supplies information? 

3. What is the concept of channel? How does one d(‘fine the rate at 
which a channel transmits information? 

4. Given a source and a channel, how does one study the joint rate of 
transmission of information and how does one go about improving that 
rate? How far can the rate be improved? 

5. To what extent does the presence of noise limit the rate of transmis- 
sion of information without limiting the communication reliability? 

To present systematic answers to these questions is our principal task. 
This is undertaken in the following chapters. However, for the benefit 
of those who wish to acquire a heuristic introduction to the subject, we 
include a brief discussion of it here. 

1-3. A Quantitative Measure of Information. In our study we deal 
with ideal mathematical models of communication. We confine our- 
selves to models that are statistically defined. That is, the most sig- 
nificant feature of our model is its unpredictability. The source, for 
instance, transmits at random any one of a set of prespecified messages. 
We have no specific knowledge as to which message will be transmitted 
next. But we know the probability of transmitting each message 
directly, or something to that effect. If the behavior of the model were 
predictable (deterministic), then recourse to measuring an amount of 
information would hardly be necessary. 

When the model is statistically defined, while we have no concrete 
assurance of its detailed performance, we are able to describe, in a sense, 
its “over-all” or “average” performance in the light of its statistical 
description. In short, our search for an amount of information is virtu- 



6 


INTBODUCTION 


ally a search for a statistical parameter associated with a probability 
scheme. The parameter should indicate a relative measure of uncer- 
tainty relevant to the occurrence of each particular message in the message 
ensemble. 

We shall illustrate how one goes about defining the amount of informa- 
tion by a well-known rudimentary example. Suppose that you are faced 
with the selection of equipment from a catalog which indicates n distinct 
models; 

[Xi,Xi, . . . ,Xn] 

The desired amount of information I{xk) associated with the selection of a 
particular model Xk must be a function of the probability of choosing Xk : 

I{x,) =KP[x,]) (1-2) 

If, for simplicity, we assume that each one of these models is selected 
with an equal probability, then the desired amount of information is 
only a function of n. 

(1.2a) 

Next assume that each piece of equipment listed in the catalog can be 
ordered in one of m distinct colors. If for simplicity we assume that the 
selection of colors is also equiprobable, then the amount of information 
associated with the selection of a color Cj among all equiprobable colors 
[ci,C 2 , . . . fCfn\ is 

/ 2 (c,) =/(P|c,)) (1-26) 

where the function f{x) must be the same unknown function used in 
Eq. (l-2o). 

Finally, assume that the selection is done in two ways: 

1. Select the equipment and then select the color, the two selections 
being independent of each other. 

2. Select the equipment and its color at the same time as one selection 
from mn possible equiprobable choices. 

The search for the function f{x) is based on the intuitive choice which 
requires the equality of the amount of information associated with the 
selection of the model Xk with color c, in both schemes (l-2c) and (l-2d). 

/(Xt and c,) = = f^+f (1.2c) 

I{xk and Cj) = /f— ) 

\inn/ 


(l-2d) 



INTHODUCTION 


7 


Thus 



(1-3) 


This functional equation has several solutions, the most important of 
which, for our purpose, is 

/(x) = -loga; (1-4)* 


To give a numerical example, let n = 18 and m = 8. 


Ii(xk) = log 18 
h{ci) = log 8 
I{xk and cy) = Ii{xu) + 

I(xk and Cj) = log 18 + log 8 = log 144 


Thus, when a statistical experiment has n equiprobable outcomes, the 
average amount of information associated with an outcome is log n. The 
logarithmic information measure has the desirable property of additivity 
for independent statistical experiments. These ideas will be elaborated 
upon in Chap. 3. 

1-4. A Binary Unit of Information. The simplest case to consider is a 
selection between two equiprobable events Ei and £ 2 . Ei and E 2 may 
be, say, head or tail in a throwing of an ‘‘honest^' coin. Following Eq. 
(1-4), the amount of information associated with 


the selection of one out of two equiprobable 
events is 

- log = log 2 

An arbitrary but convenient choice of the base of 
the logarithm is 2. In that case, — log 2 H = 1 
provides a unit of information. This unit is 
commonly known as a bit.'\ 



Fig. 1-4. A probability 
space with two equi- 
probable events. 


* Another solution is 


fj,x) = number of factors in decomposition of " in product of primes with minus sign 
For example, let n = 18, m * 8; then 

»-a.3-3 /(I 



m * 2 ■ 2 • 2 



mn «*2-2-2-2-3*3 



In information theory we require that/Ci) be a decreasing function of the probability 
of choices. This narrows down the solution of Eq. (1-4) to k log x, where fc is a con- 
stant multiplier. [For an axiomatic derivation, see Sec. 3-19 or A. Feinstein (I).l 
t When the logarithm is taken to the base 10, the unit of information corresponds 
to the selection of one out of ten equiprobable cases. This unit is sometimes referred 


to as .a Hartley since it was suggested by Hartley in 1928. When the natural base is 
used, the upit of information is abbreviated as nat. 




8 


INTRODUCTION 


Next consider the selection of one out of 2^, 2*, 2^ . . . , 2^ equally 
likely choices. By successively partitioning a selection into two equally 
likely selections, we come to the conclusion that the amounts of informa- 
tion associated with the previous 
selection schemes arc, respectively, 
2, 3, 4, ..., AT bits. 

In a slightly more general 
case, consider a source with a 
finite number of messages and 
their corresponding transmission 
probabilities. 


f.r],.r2, . - . ,Xn\ 

. . . ,P[xn\] 

The source selects at random each 
one of these messages. Successive 
selections are assumed to be statis- 
tically independent. The probabil- 
ity associated with the selection of 
message Xk is P{xk]. The amount 
of information associated with the transmission of message Xk is 
defined as 


Ficj. ]-5. Successive partitioning of the 
probability space. 


Ik = - log P\Xk\ 

is also called the amount of self-information of the message .r*-. The 
average information per message for the source is 

71 

I = statistical average of Jk = - ^ P[x,,] log P{.Ck] (1-5) 

A:=l 

For instance, the amount of information associated with a source of the 
above type, transmitting two symbols 0 and 1 with equal probability, is 

I =— 04 log log }4) = I hit 

If the two symbols were transmitted with probabilities a. and 1 — a, 
then the average amount of information per symbol becomes 

/ = — a log a — (1 — a) log (1 — a) (1-6) 

The average information per message I is also referred to as the entropy 
(or the communication entropy) of the source and is usually denoted by 

the letter 1 or instance, the entropy of a simple source of the above 
type is 

(pnP2, . . . ,p J = — (pi log Pi + P2 log P2 + * • • + Pn log Pn) 




INTRODUCTION 


9 


where (pi,P 2 , . . . ,pn) refers to a discrete complete probability scheme. 
Figure 1-6 shows the entropy of a simple binary source for different 
message probabilities. 

Next, consider a second similar source having m symbols, and designate 
the amount of information per symbol of the two sources by H(n) and 
H{m), respectively. If the two sources transmit their symbols inde- 
pendently, their joint output might be considered as a source having mn 
distinct pairs of symbols. It can be 
shown that for two such independent 
sources the average information per 
joint symbol is 

H{mn) = H{m) + H{n) 

The formal derivation of this relation 
is given in Chap. 3. 

1-6. Sketch of the Plan. From a 
mathematical point of view, the 
heuristie exposition of the previous 
two sections is somewhat incomplete. 

We still need to formalize our understanding of the basic concepts 
involved and to develop techniciues for studying more complex physical 
models. It was suggested that, given an independent source S which 
transmits messages Xk from a finite set 

[Xi,T 2, . . . ,Xn] 

[P{X^],P{X2], . . . ,P{X^}] 

there is an average amount of information I{x) associated with the inde- 
pendent source S. 

I[x) = expected value or average of I{xk) for all messages 

Our next step is to generalize this to the case of random variables with 
two or more not necessarily statistically independent dimensions, for 
instance, to define the amount of information per symbol of a scheme 
having pairs of statistically related symbols {xk,yk)- This investigation 
in turn wnll lead to the study of a channel driven by the source supplying 
information to that channel. It will be shown that the average informa- 
tion for such a system is 

Expected value of I{xkfyk) = I{X;Y)* (1-7) 

* This consideration will be further generalized to the study of the transinfor- 
mation in discrete and continuous channels. We shall study the upper bound for 
transinformation under a variety of plausible circumstances. Previously we men- 
tioned our use of many undefined technical terms such as event, probability of an 
event, discrete random variable, expected-value source, statistical independence, 
channel, code, encoding and decoding, channel capacity, etc. These terms as well as 
many other technical terms will be clearly defined as they are introduced later in the 
book. 



Fio. l-(). The (‘iitropy of an independ- 
ent discrete memory less binary 
source. 



10 


INTRODUCTION 


From a physical point of view, the above model may be viewed in a 
simpler fashion. Consider a source transmitting any one of the two 
messages xi and Xa with respective probabilities of a and 1 — a. The 
output of this source is communicated to a receiver via a noisy binary 
channel. The channel is described by a stochastic matrix : 

[ a 1 — a] 

1-6 6 J 

When Xi is transmitted, the probability of a correct reception is a and 
otherwise 1 — a. Similarly, when is transmitted, the probability of 
correct and incorrect receptions are b and 1 — 6, respectively. 

It will be shown (Chap. 3) that there is an average amount of informa- 
tion I{X;Y) associated with this model which exhibits the rate of the 
information transmitted over the channel. This, in turn, raises a basic 
question. Given such a channel, what is the highest possible rate of 
transmission of information over this channel for a specified class of 
sources? In this manner, one arrives in a natural way at the concept of 
channel capacity and efficiency of a statistical communication model. 

In the above example, the capacity of the channel may be computed by 
maximizing the information measure I(X;Y) over all permissible values 
of a. 

In short, with each probability scheme we associate an entropy which 
represents, in a way, the average amount of information for the outcomes 
of the scheme. When a source and a receiver are connected via a channel, 
several probability schemes such as conditional and joint probabilities 
have special significance. An important task is to investigate the physi- 
cal significance and the interrelationships between different entropies 
in a communication system. The formal treatment of these relations 
and the concept of channel capacity is presented in several chapters of 
the text. 

The reader acquainted with probability theory may regard information 
theory as a new branch of that discipline. He can grasp it at a fair speed. 
The reader without such a background has to move much more slowly. 
However he will find the introductory material of Chap. 2 of substantial 
assistance in the study of Chaps. 3 and 4. An introductory treatment of 
a random variable assuming a continuum of values is given in Chap. 5. 
Chapter 6 presents a general study of averaging and moments. The 
reader with such a background will readily recognize the entropy func- 
tions that form the nucleus of information theory as moments of an 
associated logarithmic random variable : — log P{X\, Thus the entropy 
appears to be a new and useful form of moment associated with a proba- 
bility scheme. This idea will serve as an important link in the inte- 
gration of information and probability theories. Chapter 7 gives a concise 



INTRODUCTION 


11 


introduction to multinormal distributions, laws of large numbers, and 
central-limit theorems. These are essential tools for the proof of the 
main theorems of information theory. 

Chapters 8 and 9 extend the information-theory concept to random 
variables assuming a continuum of values (also continuous signals). 
The probability background of Chaps. 2, 5, 6, 7, and 10 is in most part 
indispensable for the study of information theory. However, a few 
additional topics are included for the sake of completeness, although they 
may not be directed toward an immediate applicaton. 

Chapter 10 presents a bird^s-eye view of stochastic theory, followed 
by Chap. 11, which studies the information theory of stochastic models. 
A slightly more advanced consideration (but perhaps the heart of the 
subject) appears in Chap. 12. 

A main application of the theory thus far seems to be in the devising 
of an efficient matching of the information source and the channel, the 
so-called coding theory. The elements of this theory appear in Chaps. 4 
and 13. The Appendix is designed to introduce the reader to a few of 
the many topics available for further reading in this field. 

1-6. Main Contributors to Information Theory. The historical back- 
ground of information theory cannot be covered in a few pages. For- 
tunately there are several sources where the reader can find a historical 
review of this subject, e.g., The Communication of Information, by E. C. 
Cherry {Am. Scieniisij October, 1952), and ‘‘On Human Communica- 
tion,” by the same author (John Wiley & Sons, Inc., 1957). (In Chap. 
2 of the latter book, Cherry gives a very interesting historical account of 
developments leading to the discovery of information theory, particularly 
the impact of the invention of telecommunication.) 

As far as the communication engineering profession is concerned, it 
seems that the first attempt to define a measure for “the amount of 
information” was made by R. V. L. Hartley* in a paper called Trans- 
mission of Information {Bell System Tech. vol. 7, pp. 535-564, 1928). 

Hartley suggested that “information” arises from the successive selec- 
tion of symbols or words from a given vocabulary. From an alphabet 
of D distinct symbols we can select different words, each word con- 
taining N symbols. If these words were all equiprobable and we had to 
select one of them at random, there would be a quantity of information I 
associated with such a selection. Furthermore, Hartley suggested the 

* The work of Hartley was greatly influenced by a law which was simultaneously 
discovered in 1924 by Nyquist in the United States and Kupfmuller in Germany. 
The Nyquist-Kupfmuller law states that for transmitting telegraph signals at a given 
rate a definite-frequency bandwidth is required. Among those who contributed to the 
refinement of this law, which has been closely interwoven with the concept of a 
measure of information, are D. Gabor (1946) and D. M. Mackay (1948). 



12 


INTRODUCTION 


logarithm to the base 10 of the number of possible different words sb 
the quantity of information I = N log D. 

The main contributions, which really gave birth to the so-called 
information theory, came shortly after the Second World War from 
the mathematicians C. E. Shannon and N. Wiener. Wiener's mathe- 
matical contributions to the field of Fourier series and later to time series, 
plus his genuine interest in the field of communication, led to the foun- 
dation of communication theories in general. His two books, “Cyber- 
netics” and “Extrapolation, Interpolation, and Smoothing of Stationary 
Time Series” (1948 and 1949), paved the way for the arrival of new 
statistical theories of communication. In a paper entitled The Mathe- 
matical Theory of Communication (Bell System Tech. J., vol. 27, 1948), 
Shannon made the first integrated mathematical attempt to deal with 
the new concept of the amount of information and its main consequences. 
Shannon's first paper, along with a second paper, laid the foundation for 
the new science to be named information theory. Shannon's earlier 
contribution may be summarized as follows: 

1. Definition of the amount of information from a semiaxiomatic 
point of view. 

2. Study of the flow of information for discrete messages in channels 
with and without noise (models of Figs. 1-1 and 1-3). 

3. Defining the capacity of a channel, that is, the highest rate of trans- 
mission of information for a channel with or without noise. 

4. In the light of 1, 2, and 3, Shannon gave some fundamental encod- 
ing theorems. These theorems state roughly that for a given source 
and a given channel one can always devise an encoding procedure leading 
to the highest possible rate of transmission of information. 

5. Study of the flow of information for continuous signals in the pres- 
ence of noise, as a logical extension of the discrete case. 

Subsequent to his earlier work. Shannon has made several additional 
contributions. These have considerably strengthened the position of the 
original theory. 

Following Wiener's and Shannon's works an unusually large number of 
scientific papers appeared in the literature in a relatively short time. A 
bibliography of information theory and allied topics might now, 13 years 
after the publication of Shannon's and Wiener's works, contain close to 
1,000 papers. This indicates the great interest and enthusiasm (per- 
haps overenthusiasm) of scientists toward this fascinating new discipline. 
Here it would be impossible to give a detailed account of the contribu- 
tions in this field. The reader may refer to A Bibliography of Informa- 
tion Theory, by F. L. Stumpers, and also to IRE Transactions on Infor- 
mation Theory (vol. IT-1, no. 3, pp. 31-47, September, 1955). 

Even though a historical account has not been attempted here, the 



INTBODUCTION 


13 


names of some of the contributors should be mentioned in passing. Bell 
Telephone Laboratories appears to be the birthplace of information and 
coding theory. Among the contributors from Bell Labs are E. N. Gilbert, 
R. W. Hamming, J. L. Kelley, Jr., B. McMillan, S. 0. Rice, C. E. 
Shannon, and D. Slepian. P. Elias, R. M. Fano, A. Feinstein, D. 
Huffman, C. E. Shannon, N. Wiener, and J. A. Wozencraft of the 
Massachusetts Institute of Technology have greatly contributed to the 
advancement of information and coding theory. Information theory 
has received significant stimuli from the works of several Russian mathe- 
maticians. A. I. Khinchin, by employing the results of McMillan and 
Feinstein, produced one of the first mathematically exact presentations 
of the theory. Academician A. N. Kolmogorov, a leading man in the 
field of probability, and his colleagues have made several important con- 
tributions. A few of the other Russian contributors are R. L. Dobrushin, 
p. A. Fadiev, M. A. Gavrilov, I. M. GeFfand, A. A. Kharkevich, V. A. 
Kotelnikov, * M. Rozenblat-Rot, V. I. Siforov, and I. M. and A. M. laglom. 

The afore-mentioned names are only a few of a long list of mathema- 
ticians and communication scientists who have contributed to information 
theory. Some other familiar names are D. A. Bell, A. Blanc-Lapierre, 
L. Brillouin, N. Abramson, D. Gabor, S. Goldman, I. J. Good, N. K. 
Ignatyev, J. Loeb, B. Mandelbrot, K. A. Meshkovski, W. Meyer-Eppler, 
F. L. Stumpers, M. P. Schutzenberger, A. Perez, W. Peterson, A. 
Thomasian, R. R. Varsamov, J. A. Ville, P. M. Woodward. 

A list of those actively engaged in the field would be too long to be 
included here. Reference to some of the current work will be found in 
the text and in the bibliography at the end of the book. 

For a comprehensive list, the reader is referred to existing bibliog- 
raphies such as those by Stumpers, Green, and Cherry. Recent con- 
tributions to information theory have been aimed at providing more exact 
proofs for the basic theorems stated by earlier contributors. A state of 
steady improvement has been prevailing in the literature. 

McMillan, Feinstein, and Khinchin have greatly enhanced the elegance 
of the theory by putting it on a more elaborate mathematical basis and 
providing proofs for the central theorems as earlier stated by C. E. 
Shannon. These contributors have confirmed that under very general 
circumstances, it is possible to transmit information with a high degree 
of reliability over a noisy channel at a rate as close to the channel capacity 
as desired. 

J. Wolfowitz derived a strong converse of the fundamental theorem of 
information theory. Among other important theorems, he proved that 
reliable transmission at a rate higher than the channel capacity is not pos- 

* Kotelnikov is known particularly for the development of the theory of potential 
noise immunity in the presence of white gaussian noise. 



14 


INTRODUCTION 


sible. In the past 2 or 3 years a large number of scientists have become 
interested in integrating some of the work on encoding theory within the 
framework of classical mathematics. Reference will be made to their 
work in Chaps. 13 and 14. 

S. Kullback has described the growth of information theory from its 
statistical roots and emphasized the interrelation between information 
theory and statistics (Kullback). 

The study of time-varying channels has also received considerable 
attention. Among those who have contributed are C. E. Shannon, R. A. 
Silverman and S, H. Chang, and V. I. Siforov and his colleagues. 

To sum up, the present trend in information theory seems to be as 
follows: From an engineering point of view, a search for applications of 
the theory (radar detection, speech, telephone and radio communication, 
game and decision theory, and particularly implementation of codes) is 
evident, while the mathematician is still seeking for more rigor in the 
foundation of the theory and elegance par excellence. 

1-7, An Outline of Information Theory. If we were to make a two- 
page r6sum6 of information theory for those scientists with a broad back- 
ground of probability theory, the following could be suggested. 

1. The average amount of information conveyed by a discrete random 
variable V about another discrete random variable X is suggested by 
C. E. Shannon. 

w m 

i(X;r) - ^ ^ nx - r - ».| loe d-s) 

t = l 4 = 1 

This definition can be generalized to cover not only the case of two or 
more random variables assuming a continuum of values but also the more 
general case of random vectors, generalized functions, and stochastic 
processes (GePfand and laglom*). 

2. The channel is specified by P{F = yk\X = Xi) for all encountered 
integers i and fc. The largest value of the transinformation I{X;Y) 
obtained over all possible source distribution P{X = Xt] is called the 
capacity of the channel [Shannon (I)]. 

The definition of the channel capacity can be subjectecTto generaliza- 
tions similar to those suggested in 1. 

3. Let X and Y be two finite sets of alphabets with x ^X and y G Y, 
The simplest channel is specified by P{ |x G X}. Now consider words 
of n symbols selected from the X alphabet. These words will be denoted 

• I. M. Gel’fand and A. M. laglom, Calculation of the Amount of Information 
about a Random Function Contained in Another Such Function, Uapekhi Mai. 
I^auk S.S.S.R. (N.S.), vol. 12, no. 1(73), pp. 3-52, 1957; or Am. Math. Sac. TransL, 
ser. 2, vol. 12, 1959. 



INTRODtJCTION 15 

by w G and their corresponding received pairs by v S V. This is an 
nth-order extension of the channel. 

4. Given a source P[X = x^\y a channel P[ |X = a:*), and their 
respective nth-order extensions, then to a specified message ensemble U, we 
may associate a partitioning of the V space such that 

Vk-^ Bk fc = 1 , 2, . . . , iV 

Bk r\ Bj = 0 for k 9^ j k = I, 2, . . . j N 

P{Bk\uk} > 1 - X A: = 1, 2, . . . , AT 

X a specified positive 
number usually very small 

This is a decision scheme which in turn specifies a code (iV,n,X) [A. 
Feinstein (I)]. 

5. The central theme of information theory is the following so-called 
fundamental coding theorem. Given a source, a channel with capacity C, 
and two constants 

0 < H <C 0<X<1 

it can be shown that there are an integer n = g{\jH) and a code (iV,n,X) 
with {N = function of X and n) > 2”^. This is the coding theorem stating 
the possibility of transmitting information at a rate H < C over a noisy 
channel under specified circumstances. 

6. Further elaborate mathematical treatment of the concepts of infor- 
mation theory was presented by B. McMillan, who extended the defini- 
tion of source and channels from a Markov chain to stationary processes. 
A proof of the fundamental theorem of 5 as well as a clear understand- 
ing of the concepts involved in 5 is due to Feinstein. Khinchin con- 
siderably improved the status of the art in general and gave a proof of 
the fundamental theorem of 5 for the case of stationary processes. A 
converse of the fundamental theorem of 5 is due to J. Wolfowitz, who also 
gave sharper estimates than those given in 5. Remaining questions 
include the search for more general encoding theorems along the lines 
suggested in 1. A recent step in this direction was taken by C. E. 
Shannon.* The search for engineering applications, particularly low- 
error probability codes, is ever increasing. 


PROBLEMS 

1-1. An alphabet consists of four letters A, C, D with respective probabilities of 
transmission %. Find the average amount of information associated with 

the transmission of a letter. 

* C. E. Shannon, Probability of Error for Optimal Codes in a Gaussian Channel, 
Bell System Tech, J., vol. 38, no. 3, pp. 611-656, 1969. 



16 


INTRODUCTION 


1 - 2 . An independent, discrete source transmits letters selected from an alphabet 
consisting of three letters A, B, and C, with respective probabilities 

PA * 0.7 pB = 0.2 pc 0.1 

(o) Find the average entropy per letter. 

(b) If consecutive letters are statistically independent and two-symbol words are 
transmitted, find all the pertinent probabilities for all two-letter words and the 
entropy of the system of such words. 

1 - 8 , Plot the curve y = -x log 2 x for 

0 <x < 1 

1 . 4 . A pair of dice are thrown. We are told that the sum of the faces is 7. What 
is the average amount of information contained in this message (that is, the entropy 
associated with the probability scheme of having the sum of the faces equal to 7, 8, 
. . . , 12 )? 

1 - 5 . An alphabet consists of six symbols A, B, C, Z), E, and F which are transmitted 
with the probabilities indicated below : 


A 

0 


B 

01 

H 

C 

on 


D 

0111 

Ke 

E 

01111 

H2 

F 

011111 


(a) Find the average information content per 

letter. 


(6) If the letters are encoded in a binary system as shown above, find P|] 1 and 
P10| and the entropy of the binary source. 

1 - 6 . A bag contains 100 white balls, 50 black balls, and 50 blue balls. Another bag 
contains 80 white balls, 80 black balls, and 40 blue balls. Determine the average 
amount of information associated with the experiment of drawing a ball from each 
bag and predicting its color. The result of which experiment is, on the average, 
harder to predict? 

1-7. There are 12 coins, all of equal weight except one, which may be lighter or 
heavier. Using information-theory concepts, show that it is possible to determine 
which coin is the odd one and indicate whether it is lighter or heavier in not more than 
three weighings with an ordinary balance. 

I -8. Solve Prob, 1-7 when the number of coins is N. What is the minimum num- 
ber of weighings? 

1 - 9 . There are seven coins, five of equal weight and the remaining two also of 
equal weight but lighter than the first five coins. Find the minimum number of 
weighings necessary to locate these two coins. * — 

•For a general discussion of coin-weighing problems, see A. M, laglom and 
I. M, laglom, “Probability et information” (translated from Russian), Dunod, 
Paris, 1959, 



PART 1 


DISCRETE SCHEMES WITHOUT MEMORY 


. . . choose a set of symbols, endow them with certain properties and 
postulate certain relationships between them. Next, . . . deduce further 
relationships between them. . . . We can apply this theory if we know 
the exact physical significance” of the symbols. That is, if we can find 
objects in nature which possess exactly those properties and inter-relations 
with which we endowed the symbols. . . . The “pure” mathematician 
is interested only in the inter-relations between the symbols. . . . The 
“applied” mathematician always has the problem of deciding what is the 
exact physical significance of the symbols. If this is known, then at any 
stage in the theory we know the physical significance of our theorems. But 
the strength of the chain depends on the strength of the weakest link, and 
on occasion the link of “physical significance” is exceedingly fragile. 

J. E. Kerrich, “An Experimental Introduction to the Theory of Probability” 

Belgisk Import Co., Copenhagen 




CHAPTER 2 


BASIC CONCEPTS OF DISCRETE PROBABILITY 


2-1. lotuitive Background. Most of us have some elementary intui- 
tive notions about the laws of probability, and we may set up a game or an 
experiment to test the validity of these notions. This procedure is 
much like the so-called classical approach to the theory of probability, 
which was commonly used by mathematicians up to the 19308. How- 
ever, this approach has been subjected to considerable criticism; indeed, 
the literature on the subject contains many contradictions and contro- 
versies in the writings of the major authors. These arise from the 
intuitive background used and the lack of well-defined formalism and 
rigor. Thus, the experiment or game is usually defined by assuming 
certain symmetries and by accepting certain results a priori, such as the 
idea that certain possible outcomes are equally likely to occur. For 
example, consider the following problem: Two persons, A and B, play a 
game of tossing a coin. The coin is thrown twice. If a head appears in 
at least one of the two throws, A wins. Otherwise, B wins. Intuitively, 
it seems that the four following possible outcomes are equally probable ; 

{HH), iHT), (TH), (TT) 

where H denotes head and T denotes tail. A may assume that his 
chances of winning the game are since a head occurs in three out of 
four cases (to his advantage). On the other hand, the following reason- 
ing may also seem logical. If the outcome of the first throw is H, A 
wins; there is no need to continue the game. Accordingly, only three 
possibilities need be considered, namely: 

(H), {TH), and (TT) 

where the first two cases are favorable to A and the last one to B. In 
other words, the probability that A wins is really % instead of The 
intuitive approach in this problem thus seems to lead to two different 
estimates of probability. * 

The twentieth century has witnessed enormous advances in the rigor- 
ous axiomatic treatment of many branches of mathematics. It is true 

* The reasoning which assumes the equiprobable outcomes is incorrect. See also 
Prob. 2-38. 


19 



20 


DISCRETE SCHEMES WITHOUT MEMORY 


that the axiomatic approach is essentially present in the familiar euclidean 
geometry and is, in a way, a very old principle. But it was not until the 
early twentieth century, when the formal and logical structure of mathe- 
matics was given serious, systematic study, that its fundamental and 
profound implications were recognized. Actually, however, the ground- 
work for the axiomatic treatment was laid by mathematicians such as 
Peano, Cantor, and Boole during the middle of the nineteenth century. 
The later efforts of Hilbert, Russell, Whitehead, and others led to a com- 
plete reorientation of the basic formulations, bringing mathematics to its 
present level. 

Although consideration of the axiomatic treatment is not our subject 
here, it may be interesting to point out its general nature. First, a 
necessary set of symbols is introduced. Then certain inference or oper- 
ation rules are given for the desired formal manipulation of the symbols, 
and a proper set of axioms is determined. The formal system thus 
created must be consistent; that is, the axioms must be independent and 
noncontradictory. Strictly speaking, the derivation of the theorems is 
manipulation of symbols without content, using axioms as a starting 
point and applying the rules of operation. The fundamental nature of a 
formal system is by no means obvious, and the limitations are even today 
under very careful study. 

A rather new branch of mathematics exists which deals in an axiomatic 
manner with properties of various abstract spaces and functions defined 
over these spaces. This is the so-called measure theory.'' In the late 
1930s and early 1940s attempts were made to put the probability calculus 
on an axiomatic basis. The work of Kolmogorov, Doob, and many others 
has contributed greatly toward this aim. Today formal probability 
theory is an important branch of measure theory (in a strictly formal 
sense), although the epistemological meaning of probability itself is 
subject to philosophical discussion. This latter aspect has been studied 
by several profound thinkers (von Neumann, Carnap, Russell, Fisher, 
Neyman, and many others). 

Today engineers and research scientists recognize that they must have 
a working knowledge of the powerful tools of twentieth-century mathe- 
matics. Although completely axiomatic and rigorous treatment of this 
subject is far beyond the scope of this discussion, a classical presentation 
would be out of date, as it would completely forgo the important modern 
contributions to the theory. Under these circumstances, it seems that a 
survey of the modern theory of probability at a nonprofessional level will 
be a reasonable compromise. Most engineering students are not very 
familiar with concepts of probability, and it is important that they gain 
some appreciation of them. 

In what follows, some elementary concepts of the theory of sets or 



BASIC CONCEPTS OF DISCRETE PROBABILITY 


21 


so-called “set algebra “ must first be introduced. Then these concepts 
are used to introduce the fundamental definitions of the theory of proba- 
bility. Such a presentation allows a much wider application of the 
probability theory than does the older approach, which is inadequate for 
attacking a large class of modern problems. 

2-2. Sets. The word setj in mathematics, is used to denote any collec- 
tion of objects specified according to a well-defined rule. Each object in a 
set is called an element j a member , or a point of the set. If x is an element 
of the set X, this relationship is expressed by 

X ^ X X belongs to X (2-1) 

When X is not a member of the set X, this fact is shown by 

X ^ X X does not belong to X (2-2) 

For example, if X is the set of all positive integers, then 

5GX 

V2^X 

-3GX 

A set can be specified by either giving all its elements or stating the 
requirements for the elements belonging to the set. If a, b, c, and d are 
the only members of a set X, then we may write either 

X = {a,6,c,d} (2-3) 

or X = {x) (2-4) 

In the latter case x designates a general element of X with the under- 
standing that the rule for identif 3 dng the members of X is known. For 
example, if the set X consists of the number of dots on the faces of a die, 
then we may write 

X = {1,2,3,4,5,61 

If the set X consists of all rectangles with an area of 1 square foot we may 
write X = (x), denoting by x any general rectangle having the specified 
area. 

When every element of a set A is a member of a set 5, we say that A is a 
subset of jB. This relationship is expressed by either of the forms 

A (Z B A is contained in B (2-5) 

or 5 D A A is a subset of B (2-6) 

For example, if A is the set of positive integers and B the set of all rational 
numbers, then^ A is a subset of B. 



22 DISCRETE SCHEMES WITHOUT MEMORY 


The sets A and B are said to be equal if they have exactly the same ele- 


ments, that is, if 

and 

then 


A C B 
ADB 
A = B 


(2-7) 


For instance, if the set A consists of the roots of the equation 
x(x + l)(x‘^ - 4)(x - 3) = 0 


and 

then 


5=1-2, 
C = |x) 
CD A 
cdb 
adb\ 
acb] 


1,0,2,31 

X being any integer such that |a;| < 4 


A = B 


In many instances, when dealing with specific problems, it is most con- 
venient to confine the discussion to objects that belong to a fixed class of 
elements. This is referred to as a universal set. For example, suppose 
that, in a certain problem dealing with the study of numbers, it may be 
required to define the set of all integers /, or the set of positive numbers 
P, or the set of perfect square integers S, All these sets can be looked 
upon as subsets of the larger set of all real numbers. This latter set 

may be considered as the universal 
set [/, a definition which is useful in 
dealing with the specific problem 
under discussion. 

In problems concerned with the 
interrelationship of sets, an illustra- 
tive diagram called a Venn* diagram 
is of considerable visual assistance. 
The elements of the universal set in 
a Venn diagram are generally shown 
by points in a rectangle. The elements of any set under consideration are 
commonly shown by a circle or by any other simple closed-contour inside 
the universal set. The universe associated with the aforesaid example is 
illustrated in Fig. 2-1. 

A set may contain a finite or an infinite number of elements. When a 
set has no element, it is said to be an empty or a null set. For example, 
the set of the real roots of the equation 



Fig. 2-1. Example of a Venn diagram. 


is a null set. 


2z2 + 1 = 0 


* Named after the English logician John Venn (1834-1923). 




BASIC CONCEPTS OF DISCRETE PROBABILITY 


23 


2-3, Operations on Sets. Consider a universal set U of any arbitrary 
elements. U contains all possible elements under consideration. The 
universal set may contain a number of subsets B, C, D, . . . which 
individually are well-defined sets. The operation of union, intersection, 
and complement is defined as follows: 

The union or sum of two sets A and B is the set of all those elements 
that belong to A or B or both. 

The intersection or product of two sets A and B is the set of all those ele- 
ments that belong to both A and B. 

The difference B ^ A oi any set A relative to the set ^ is a set consist- 
ing of all elements of B that are not elements of A. 



Fig. 2-4. Complement. Fig. 2-5. Difference A — B. 

The complement or negation of any set A is the set A' containing all ele- 
ments of the universe that are not elements of A . 

In the mathematical literature the following notations are commonly 
used in conjunction with the above definitions. 


AKJB 

A union B, or A cup B 

(2-8) 

Ar\B 

A intersection B, or A cap B 

(2-9) 

A - B 
BCA 

relative complement of B in A 

B is contained in A 

(2-10) 

~ A 

complement of A 

(2-11) 


In the engineering literature the notations given below are primarily 
used. 


A -f- B 

sum or union 

(2-12) 

A • B or AB 

intersection or product 

(2-13) 

A - B 

difference 

(2-14) 

A' 

complement 

(2-15) 






24 DISCBBTB SCHBMBS WITHOUT MEMOBT 

For the convenience of the engineer we shall generally adhere to the latter 
notations. However, where any confusion in notation may occur we shall 
resort to mathematical notation. 

The universe and the empty set will be denoted by U and 0, respectively. 
When the product of two sets A and B is an empty set, that is, 

AI^B = 0 (2-16) 

the two sets are said to be mutually exclusive. When the product of the 
two sets A and B is equal to B, then B is a subset of A. 

Ar\B = B implies BQA (2-17) 

The sum, the product, and the difference of two sets and the comple- 
ment of any set A are illustrated in the shaded areas of the Venn diagrams 



1 


o o 

I 

\qJ 


Fio. 2-6. Mutually exclusive sets. Fig. 2-7. Subset B CZ ■ AB — B. 
AB - 0. 


of Figs. 2-2 to 2-5. Figures 2-6 and 2-7 illustrate the sets referring to 
Eqs. (2-16) and (2-17). 

Example 2-1. Let the universe consist of the set of all positive integers, and let 

A - (1,2,3,6,7,10) 

B = (3,4,8,10) 

C- (z) 

where z is any positive integer larger than 6. 

Find A + B, A B, A - B, A ■ C, B ■ C,C',a3iA A + B + C. 

Solution 

A + B => (1,2,3,4,6,7,8,10) 

A B ~ (3,10) 

A - B= (1,2, 6, 7) 

A C = (6,7,10) 

B • C = (8,10) 

C = (1,2,3,4,6) 

(A -I- B) -)- C - tf - (5) 

2-4. Algebra of Sets. We now state certain important properties con- 
cerning operations with sets. Let A, B, and C be subsets of a universal 
set U ; then the following laws hold. 







BASIC CONCEPTS OP DISCBBTE PBOBABILITY 


25 




Fig. 2-8. Distributive law. A (B -f C) = Fig, 2-9. Distributive law. A -t- BC 
AB+AC. (A + B)(4 -h C). 




Fig. 2-10. Dualization. (A -|- B)’ Fio. 2-11. Dualization. (AB)' 

A'B'. A' -I- B'. 


Commutative Laws: 

A + B = B + A (2-18) 

AB = BA 

Associative Laws: 

{A + B) + C = A A- {B -\-C) (2-19) 

{AB)C = A(BC) 

Distributive Laws: 

A{B + C) = AB + AC (2-20) 

A+BC = (A+ B){A -I- C) 

Complementarity: 

AA- A' (2-21) 

AA' = 0 

A A-V =U (2-22) 

AU ^ A 

A A-9 = A (2-23) 

A0 = 0 

Difference Law: 

(AB) A- (A - B) = A 
(AB)(A -B)=9 (2-24) 

A- B = AB’ 






26 


DISCRETE SCHEMES WITHOUT MEMORY 


Dualization or De Morgan's Law: 


{A + BY = A'B' 

(ABY = A' + B' 

(2-25) 

Involution Law: 


(A'Y = A 

(2-26) 

The complement of the set A' is the set A. 


Idempotent Law: For all sets A, 


A + A = A 

(2-27) 


AA = A 

While the afore-mentioned laws are not meant to offer an axiomatic 
presentation of set theory, they are of a fundamental nature for deriving 
a large variety of identities on sets. The agreement of all these laws with 
the laws of thought can be verified. One assumes that an element :r is a 
member of the set of the left side of each identity, and then one has to 
prove that x will necessarily be a member of the set of the right side of the 
same equation. For instance, in order to prove the distributive law 
[Eq. (2-20)], let 

x G A{B + C) 

Then 

X ^ A 

X G (B + C) 

Then at least one of the following three cases must be true : 

(a) X E: A (6) X G A (c) x G A 

xEB xEC xEB 

X G 0 

These are in turn equivalent to 

(a) xEAB (b) xE AC (c) x E ABC 

but ABC C AB 

Therefore it is sufficient to require 

xEAB + AC 

Similarly, one can show that x E AB + AC implies x E A{B + C). 

The Venn diagram is often a very useful visual aid. Its use is of valua- 
ble assistance in solving problems, as long as the formal ;^oofs are not 
overlooked. 

Example 2-2. Verify the following relation : 

(A -\-B) - AB = AB' + A'B 
Solution. By virtue of the third relation of Eqs. (2-24), 

(A + B) - AB » (A -f B)(AB)' 



BABIC CONCEPTS OF DISCBETE PROBABILITY 


27 


Application of De Morgan’s law yields 

(4 + B){ABY = (4 + B)(A' + B') 

(4 + fi)(4' + B') = A' A + A'B + B'A + B'B = AB' + A'B 

For an alternative proof, let 

xE[iA+B) - AB] 


Then only one of the following two cases is possible: 


(a) X ^ A 
X ^ B 


These cases are equivalent to 


(a) j G ^ I 

xEB'l 


X G ab' 


(6) X E B 
X E A 


ih) xEB 
X E A 



A'B 


Note that AB' and A'B are mutually exclusive se ts. Similarly, one can show that all 
the elements belonging to the set at the right side of the above equation also belong 
to the set of the left side. Thus the two sides present equivalent sets. 

Example 2-3. Express the set composed of the hatched region of Fig. E2-3 in 
terms of specified sets. 

Solution. The desired set A is 


A — A 1 A 2 ^12-^3 “h AiA^A^ 


See Fig. E2-3. 



Example 2-4. Verify the relation 

(A + Bye = C - C(A ^ B) 

Solution. We may wish to verify the validity of this relation by using the Venn 
diagram of Fig. E2-4. The left side of this equation represents the part of the set C 
that is not in A ox B. The right side represents C — CA — CB, that is, the part of C 
that is not included either in A or in B. 



Fig. E2-4 



28 


DISCRETE SCHEMES WITHOUT MEMORY 


Example 2-6. Consider the relay circuit of Fig. E2-5. The setup contains coils 
which must be activated for closing or opening the corresponding relay. A, B, and C 
are normally open relays and A\ B', and C are normally closed relays which are 
respectively activated by the same controlling source. For instance, when relay A 
is open because of the effect of its activating coil, A' is closed. In order to have a 
current flow between the terminals M and iV, we must have the set of relay operations 
indicated by ABC -h AB'C + A'B'C. With this in mind, the question is to replace 
the given network by a less complex equivalent circuit. 



A B C 

A B' C 
n II n 



~lr^l Ir^ 

A' B* C 

mhmh 

(a) 



A 



Fig. E2-5 

Solution. A way of simplifying the above expression is the following: 

F = C(AB -h AB' + A'B') 

F = CIA(B + B') + A'B'] 

F = CiAU + A'B') 

F * C{A -f A'B') 

F - C(A + B') 

A circuit presentation of this example is illustrated in Fig. E2-5f). 

Example 2-6. Verify the equivalence of the two relay circuits of Fig. E2-6. 


A B' a a ' 



(a) 


(b) 


Fig. E2-6 

Solution. The set that corresponds to the operation of the circuit in Pig. E2-6& is 

(A + B)(A' + B') 

Direct multiplication gives 

AA' + AB' + BA' + BB' - AB' + A'B 
The latter set can be immediately identified with the circuit of Fig. E2-66. 

Sheffer-stroke Operation. Examples 2-5 and 2-6 have illustrated some 
use of Boolean algebra in relay circuits. As another example of the use of 




BASIC CONCEPTS OF BISCBETE PBOBABIUTT 


29 


X 

Y 



UIV) 


Fig. 2-12. Sheffer stroke. 

Boolean algebra in engineering problems, we discuss briefly what is 
referred to as the Sheffer-stroke operation. This operation for two sets X 
and Y is denoted by (XlF) and is defined by the equation 

(X\ V) = X' U F' not X, or not F, or not X and F 

The Sheffer stroke commonly illustrated by the three-port diagram of 
Fig. 2-12 has the distinct property that it can replace all three basic 


y Q < , Q 

F^io. 2-Kl. JVodiici operation by two Sheffer strokes. 



XUY 


Fig. 2-14. Summing operation by three Sheffer strokes. 

operations of Boolean algebra (sum, product, and negation). The 
validity of this statement can be exhibited in a direct manner. 




Fig. 2-16. Operation of negation with a Sheffer stroke. 

Product Operation. Reference to the diagram of Fig. 2-13 suggests 
that 

((X|F)l(Z|y)) = (X'ury 

= ((X n ryy = xnv 

Summing Operation. The diagram of Fig. 2-14 suggests 

((XlX)|(y|F)) = (X|X)'U(F|F)' 

= XU F 

Negation. Eeference is made to the diagram of Fig. 2-15. 

(XIX) = X' U X' = X' 



DISCBETE SCHEMES WITHOUT MEMORY 


30 

2-6. Functions. In this section, some well-defined objects or numbers 
will be associated with each and every element of a given set. The rule 
on which this relationship is based is commonly known as function. 



If X = {x} is a set and y = /(x) is a rule, that is, a sequence of specified 
operations and correspondence for assigning a well-defined object y to 
every member of X, then by applying this rule to the set X, we obtain a 
set F = [y]. The set X is called the domain and Y the range. When x 
covers the elements of X, then y will correspondingly cover the elements 
of Y. For example, let X be the set of all persons living in the state of 
California on January 1, 1959, and let the function be defined as follows: 
anyone who is the father of a person described by X and is in the state of 
Colorado on January 1, 1959. Assuming that all the words appearing in 
the rule, such as father, California, Colorado, are well-defined words, this 
may be considered as a well-defined function. To each member of X 
there corresponds an object in the set Y. In this example, element zero 
in Y corresponds to some of the elements of X, and several members of X 
might have a unique correspondent in Y. 

As another simple example, consider the set 


and the function 
which lead to the set 


X = {1,2,0,-2,-1,H,10} 
/(x) = x2 - 1 
Y = (0,3, -1,3,0, -^199} 


The domain of x and the range of y are shown in Fig. 2-16, the corre- 
spondence being one-to-one from X to the Y set. 

Example 2-7, A set of ordered pairs s = that is, a set of points in the 

rectangular coordinate system, is given in Fig. £2-7a. 



BASIC CONCEPTS OF DISCRETE PROBABILITY 


31 



(C) 


Fig. E2-7 

(a) Describe the elements of the subset a = < x). 

(b) Describe the elements of the subset b = s — a. 

Solution 

(a) See Fig. E2-76. 

(b) See Fig. E2-7c. 

Numerical Functions. Functions that have numerical values are the 
most common type. We can define the basic algebraic operations for a 
family of numerical functions defined over a specific domain X = {x). 
For instance, if /i(x),/2(x), and/3(a;) = const = fc have a common domain, 

/i(x) + /2(x) 

kMx) (2-28) 

/iW -f2ix) 

are also defined over the same domain. 



32 


DISCBETE SCHEMES WITHOUT MEMORY 


As a particularly interesting case of numerical function, consider the 
correspondence between the elements of a set having a finite number of 
elements and a set of positive integers. Such functions have the follow- 
ing basic propert^y : If A and\B are two disjoint sets having a number of a 
and b elements, respectively, then the number of elements of the set 
A + Bis 

n(A KJB) = n{A) + n{B) = a + b (2-29) 

where n(X) means the number of elements in the set X, The number of 
elements of a finite set has the simple but important property of being a 
real additive function. In other words, assume that A and B are them- 
selves subsets of a set S containing a finite number of subsets A, B, C, 
Z), , . . . Let / be a function that assigns a real number f{X) to each 
X C. S, such that for any two disjoint subsets of S we have 

/(AUB) =/(A) +/(B) 

Then f is called an additive set function. This result, of course, holds for 
the union of a finite number of disjoint subsets of S. 

Equivalent Sets. Let A and B be two sets. A rule that associates with 
each element a G A exactly an element 6 GB, and conversely, is said to 
be a one-to-one correspondence between A and B. Two sets A and B are 
equivalent if, and only if, a one-to-one correspondence between their ele- 
ments can be established. 

As an example, consider the set of all persons (A) living in New York 
State and (B) living in the state of Arizona at a given time. Now if we 
associate each person of A with the cardinal numbers 1 to AT, inclusive, 
and each person of B with the cardinal numbers 1 to ilf, inclusive, it is 
clear that there is a one-to-one correspondence between the elements of 
A -f B and the set of cardinal numbers 1 to M + N, inclusive. 

The number of elements in a set may or may not be finite. In the 
latter case, if the elements of the set can be placed in a one-to-one corre- 
spondence with the set of natural numbers 

{1,2,3, . . .} (2-30) 

we say that the set has a denumerable or countable number of elements. 
For example, the number of elements in the set _ 

{1,4,9,16, . . . ,n^ . . .} 

is denumerably infinite. 

A common example of nondenumerable sets can be given by consider- 
ing points on a straight line. Let x denote the abscissa of a point of the 
line segment between points A and B with respective abscissa a and 6. 
The inequality 


a < X <b 



BASIC CONCEPTS OF DISCBETE PBOBABIUTY 33 

indicates a set of points on the line AB that does not contain the end 
points A and B. Such a set is termed an open irUerval and is denoted by 

]a,6[ open interval a < x <b (2-31) 

Similarly, a closed interval is defined and denoted as follows : 

[a,b] closed interval a < x <h (2-32) 

It can be shown that the number of points in [0,1] are nondenumerable.'*' 
If the set A is equivalent to the set of points in [0,1], it is said that A has 
the power of continuum, t 

The additive property of the function under consideration, i.e., the 
number of elements in finite sets, makes the following relations self-evident. 

n(A \J B) = n(A) + n{B) — n{AB) (2-33) 

n{A — B) = n{A) — n{AB) 
n{A) + n(il') = n(t/) (2-34) 

For a set containing three subsets A, B, and C one can derive 
n{AKJ BKJC) = n[(A) U (B U C)] 

n(A\JB\JC) n{A) + n(B U C) - n(.4B U AC) (2-35) 

n{AKJ BKJC) = n{A) + n(B) + n(C) - n{BC) - n{AB) 

- n{AC) + n(^BC) 

The following example is designed to employ the additive property of 
the afore-mentioned set functions. 

Example 2-8. There are three radio stations A, B, and C which can be received in 
a town of 3,000 families. The following data are given : 

(a) 1,800 families listen to station A, 

(h) 1,700 families listen to station B. 

(c) 1,200 families listen to station C. 

(d) 1,260 families listen to stations A and B. 

(e) 700 families listen to stations A and C. 

(/) 600 families listen to stations B and C. 

(^) 200 families listen to stations A, B^ and C, 

•Bee I. P. Natanson, “Theory of Functions of a Real Variable" (translated from 
Russian), p. 21, Frederick Ungar Publishing Co., New York, 1955. 

t For a more complete mathematical treatment of probability, one has to examine 
in detail the operations on numerical functions associated with operations on denumer- 
able and nondenumerable sets. Such a detailed undertaking is avoided here for the 
sake of brevity lest the average reader find the text too elaborate. Nonetheless, for 
the sake of logical completeness, we shall try now and then to remind the reader of 
missing links. Here are some of the theorems that had to be omitted in this intro- 
ductory presentation. 

1. The sum of a finite number of disjoint denumerable sets is denumerable. 

2. The set of all rational numbers is denumerable. 

3. The sum of a denumerable number of disjoint sets each with the power of con- 
tinuum has itself the power of continuum. See T. M. Apostol, “Mathematical 
Analysis," pp. 31-33, Addison-Wesley Publishing Company, Reading, Mass,, 1957. 



34 


PISCRETE SCHEMES WITHOUT MEMORY 


Of course any family may listen to other stations besides the ones specified in each 
case. The problem is to obtain the number of families who are not listening to any 
station. 

Solution. We draw the pertinent Venn diagram of Fig. E2-8 and, starting from the 

bottom of the above list, indicate the corresponding 
number of elements of each subset on the diagram. 
The number of families in set g is 200. Thus, the 
number of families listening to B and C but not 
to A is 

n{BCA') = niBC) - n(BCA) = 600 - 200 = 400 

Following this procedure one can obtain all 
the numbers associated with each disjoint set in 
the Venn diagram. The total number of families 
listening to one or more stations is 2,350. This 
indicates that there are 650 families not listening 
to any of the above radio stations. 

Similar questions can easily be answered by 
referring to the Venn diagram of Fig. E2-8. For 
example, the number of familic's who are not listening to A but are listening either to 
B or to C or to both is 

n\A^{B U Ol = n{A'B) + MAT) - n{A'BC) 
nlA'(B U C) I = 450 + 500 - 400 - 550 

2-6. Sample Space. In this section we shall make preparations for 
applying the concept of set theory to probability. When talking about 
probability we usually have in mind what can be termed an experiment 
with certain outcomes. An outcome is any one of the possibilities that 
may be expected from the experiment. The totality of all these outcomes 
forms a universal set which is called the sample space. Each outcome is a 
point of the sample space. 

For example, the throw of an ordinary die may be considered as an 
experiment having six possible outcomes. With this experiment we 
associate a universal set containing six points, each corresponding to one 
of the outcomes of the experiment : 

11,2,3,4,5,6} 

If the die is thrown twice, the sample space associated with the experi- 
ment contains 36 points corresponding to the following outcomes: 


11 

12 

13 

14 

15 

16 

21 

22 

23 

24 

25 

26 

31 

32 

33 

34 

35 

36 

41 

42 

43 

44 

45 

46 

51 

52 

53 

54 

55 

56 

61 

62 

63 

64 

65 

66 


A sample space may be finite or infinite, if it contains a finite or an 
infinite number of points, respectively. The sample space corresponding 



BASIC CONCEPTS OF DISCRETE PROBABILITY 


35 


to a single throw of a die is finite. On the other hand, the sample space 
corresponding to an experiment of throwing the die until a 6 appears is an 
infinite space. It is possible to conceive a situation where one may have 
to throw the die infinitely many times without obtaining a 6. A sample 
space containing at most a denumerable number of elements is termed 
discrete. Sample spaces containing a nondenumerable number of ele- 
ments include the so-called contin- 
uous sample space.” In this case 
the range of the elements covers a 
continuum of values in contrast with 
the discrete set of values in the dis- 
crete sample space. 

A subset of a sample space is called 
an event. Thus, an event is a sub- 
set of a sample space containing 
any number of points or outcomes. Fig. 2-17. A probability space. 

(See Fig. 2-17.) 

An event containing no outcomes is a null set or an empty set and 
represents an event that is impossible. An event containing all sample 
points is an event that is certain to occur. This may be denoted by the 
universal set f/, which means that the event under consideration is bound 
to occur. The outcome of an event implies the occurrence of any one of 
its possible outcomes. The following glossary of terms may be of assist- 
ance in the transition from the language of set theory to that of probability 
theory ; 

TJ All possibilities. 

A CZ U A particular event. 

A = U The event A must occur (certain) . 

A = 0 The event A is impossible. 

A' The event A does not occur. 

X E: X or X C. A x is any particular outcome of X. 

The occurrence of x implies the 
occurrence of A and X. 

2 / ^ A yis not an outcome of the event A . 

ABC D = S S is the event of the simultaneous 
occurrence of events A, . . . 

D, 

A + B C + + D = S Sis the event of the occurrence of 

A or 5 or C or • • • or D, or any 
combination of these. 

ABC ■ ■ D = 0 The events A.B^C, . . . , i) are 
incompatible. 

A + B + C + +D=t/ At least one of the events A, B, 

C, . . . , D must occur. 





36 


DISCRETE SCHEMES WITHOUT MEMORY 


Example 2 - 9 . A traveler has the choice of traveling by car, train, plane, or any 
combination of the three for a particular trip. Define the sample space and express 
some of the events of interest. 

Solution-. Let C, T, and P correspond to the fact of traveling by car, train, or 
plane, respectively. The following events are self-explanatory. 

CTP traveling by car, train, and plane 
CTP' traveling by car and train but not by plane 
CT traveling by car and train (with or without plane) 

C + T traveling by car, by train, or by car and train (may or may not take the 
plane) 

f/ — P not traveling by plane 


Example 2-10. A traveler travels between cities M and N. The possible roads 



are shown in Fig. E2-10. Define the sam- 
ple space and the events that the traveler 
goes through towns A, B, or both. 

Solution. Assuming that the traveler 
does not change direction while traveling, 
the following selection of roads is possible; 

15, 25, 146, 147, 246, 247, 36, 37, 345 


Fio. E2-10 The sample space has nine points; i.e., our 

defined experiment may have nine distinct 
outcomes. The event of passing through the town A (event Ei) consists of any of 
the three points 15, 146, and 147. The event of passing through the town B (event Ei) 
consists of any of the three points 147, 247, and 37. Finally the event of passing 
through A and B (event EiE^) consists of a single point 147. Similarly, the following 
events can easily be identified: 


EiE; 15, 146 

E[E^ 37, 247 

El + E 2 15, 146, 147, 247, 37 

E[E[ 25, 246, 36, 345 


2-7. Probability Measure. In Sec. 2-5 on Functions we have asso- 
ciated arbitrary set functions with the elements of sets. In particular, we 
have outlined some numerical functions and observed certain rules such 
as the additivity relation of Eq. (2-33), when the set function was the 
number of elements in each set. The study of the mathematics of set 
functions has its place in a branch of mathematics known as measure 
theory. The probability measure is a specific type of function which can 
be associated with sets. When dealing with abstract mathematics, one 
may specify any arbitrary properties for the measure. However, the end 
result in this study is probability as applied to the physical world. It is 
mainly for this reason that we require our probability measure to fulfill 
the requirements that will be described below. These requirements are 
matters of convenience for our subsequent dealing with physical problems 
rather than a mathematical necessity. 



BASIC CONCEPTS OF DISCHETB PROBABILITY 


37 


An experiment is defined so that to each possible outcome of this 
experiment there corresponds a point in the sample space. The number 
of outcomes of this experiment is assumed to be at most denumerable. 
The outcomes are labeled by symbols a*, and a single- valued real function 
m{ajb) called the probability measure is defined. An event of interest A 
is considered as the set of the outcomes ojt giving rise to that event. The 
probability measure of an event is defined as the sum of the probability 
measures associated with all the outcomes a* of that event. Two events 
A and B are termed disjoint if they contain no outcome in common. That 
is, two disjoint events cannot happen simultaneously. The probability 
measure has the following assumed properties: 


0 <m.{ A k] (2-36)* 

m\A VJ J5) = in{A] -h in{B\ if A and B are disjoint (2-37) 
m{Xl = 0 ifX = 0 (2-38) 

m{X\=l ifX=U (2-39) 

For a more general case involving a continuous sample space one 
employs the concept of integration. This is not considered here as it 
requires rigorous mathematical treatment beyond the present scope of 
interest. The interested reader is referred to “Probability Theory'^ by 
M. Lofeve (Chap. 1). 

The above measure-theory approach is certainly valid. Any measure 
satisfying the specified requirements, when applied to a problem involv- 
ing sets, will lead to a consistent mathematical setup. For example, if 
A, Bj and C are subsets of a universal set U with an additive probability 
measure, that is, the measure associated with the union of two disjoint 
sets is equal to the sum of their individual measures, the following rela- 
tions are valid: 


m{A\ < mlB\ 
m[A\ = m[B] — m\B — A) 
m[A^] = m{U — A| = m[U 
m[A\JB} = m[{A - AB) U 
m{A} + m{B\ > m{AB] 


if A C B 

— m[A\ = 1 — m[A] 

B] = m{A\ — in{AB\ + m{B} 


(2-40) 

(2-40a) 

(2-41) 

(2-42) 

(2-43) 


For three disjoint sets, 

m{A\JBKJC] =mlA] + m[B] + m{C\ (2-44) 


* It is certainly plausible to assign measures which have numerical values greater 
than one. While such measures can be consistently applied, they have no practical 
significance as far as probability is concerned in dealing with physical problems. 
Also note that the property stated in Eq. (2-38) can be derived from the other 
properties. 



38 DISCRETE SCHEMES WITHOUT MEMORY 

For three sets in general, 

m[A'J BU C] = m{ A] m\B] + m[C\ 

- m{AB\ - m\BC] - m[CA] + m[ABC\ (2-45) 

Example 2-11. Consider a set of all intervals I contained in the closed interval 
[0,1]. With each and every interval 1 we associate a measure function L(I) equal to 
the ordinary length of the same interval. See if such a measure satisfies the require- 
ments of a probability measure. 

Soluti(m. The requirement of (2-36) is satisfied, as the length associated with 
each member of the set is a nonnegative number between 0 and 1. The condition 
(2-37) is fulfilled by nonoverlapping intervals (mutually exclusive sets). The require- 
ments (2-38) and (2-39) arc also met. For a more thorough discussion the reader is 
referred to Cramer (Chap. 4, The Lebesgue Measure of a Linear Point Set). 

2-8. Frequency of Events.* In Sec. 2-7 on Probability Measure an 
introductory axiomatic account of probability as a measure of a set was 
given. The object of this section is to supplement the set-theory point 
of view with some perhaps less formal discussion of the probability of 
occurrence of certain events of a defined experiment. In other words, we 
wish to make a transition from the suggested abstract mathematical 
measure to some empirical numerical function fulfilling the specified 
measure requirements. 

The first step toward this objective is to define an experiment such as 
the tossing of a coin or the drawing of a card from a given deck of cards. 
Next, all the outcomes of this experiment must be specified. Now con- 
sider a specific event Xk among all the possible events of the experiment 
under consideration. If the basic experiment is repeated N times among 
which the event Xk has appeared n(Xk) times, the ratio 

n(Xk) 

N 


is defined as the relative frequency of the occurrence of the event Xk- In 
case N is increased indefinitely, intuitively speaking, the “limit” of 


nj^k) 

N 


(2-46) 


as JV— > 00 is the P\Xk] probability of the event Xk. This “definition” 
of probability is more elaborate than the classical definition of Laplace 
which defines the probability as the ratio of the number of favorable 
events to the total number of possible events. In the latter definition 
all events are considered to be equally likely, that is, throwing of a true 

* This section has been inserted to accommodate those who feel more familiar with 
the old frequency concept. The intuitive concept of frequency requires a consider- 
able amount of clarification before it becomes mathematically acceptable. This can 
be done in the light of the laws of large numbers. The section may be omitted by 
those who prefer mathematical accuracy to physical justification. 




BASIC CONCEPTS OF DISCRETE PROBABILITY 

39 

die by an 
noted that 

honest person under prescribed circumstances. 

It is to be 


0 < n{Xk) < N 

(2-47) 


0 < < 1 

(2-48) 


0 < lim < 1 

- jv- N - 

(2-49) 


Equation (2-49) states that the probability of any event Xk is a real 
number in the real interval [0,1]. 

0 < P{Xk} < 1 (2-50) 

Considering an event that occurs in every observation yields the limiting 
case P{Xc] = 1 , which is a certain event Also, an event that never occurs 
will lead to the other limiting case, 

P{x] =0, which is an impossible 
event 

We have thus far shown that this 
empirical definition of probability 
satisfies the requirements (2-36), 

(2-38), and (2-39). It remains to be 
seen whether the requirement (2-37) 
holds or not. In order to verify this, 
consider two particular events A and 
B among the events that result from the experiment. Let the experiment 
be repeated n times. Each observation can belong to only one of the 
four following categories (Fig. 2-18) : 

1. A has occurred but not the event AB' . 

2. B has occurred but not A, the event BA\ 

3. Both A and B have occurred, the event AB. 

4. Neither A nor B has occurred, the event A'B\ 

Note that 

A = AB'KJ AB 
B = BA' \J AB 
AKJ B = AB'KJ AB\J BA' 



Fig. 2-18. Probability space of two 
events. 


If the number of events of each category is denoted by ni, n 2 , ria, and n^, 
respectively, the following equations are self-explanatory: 


ni + n 2 + m + Tii = n 

f{A}f relative frequency of A independent of B 
f{B}j relative frequency of B independent of A 


ni + na 
n 

ria + na 
n 


(2-51) 

(2-52) 


(2-53) 




40 


DISCRETE SCHEMES WITHOUT MEMORY 


f{A + relative frequency of either ^4, B, or both = 


?ii “ 1 “ 712 "H W3 


n 


f{AB]f relative frequency of A and B occurring together = — (2-55) 

f\A\B]f relative frequency of A under condition that B has occurred 

^ na 

n2 + na 

f{B\A 1, relative frequency of B under condition that A has occurred 

_ na 
ni + na 


(2-54) 


(2-56) 

ed 

(2-57) 


When the number of experiments tends to infinity, these simple relations 


with proper interpretation lead to the addition law and multiplication 
law: 

P{A\JB} = P{A} + P{B} - P{AB} (2-58) 

P{AB\ = P[A\P[B\A\ (2-59) 

P{AB] = P[B\P[A\B] (2-60) 

For the special case of mutually exclusive events, P{AB) = 0, 

P{A + B} = PIA] +P{B\ (2-61) 

Equation (2-61) shows the validity of the requirement (2-37) for the 


chosen set function termed the relative frequency of the event. 

More specifically, we have proved that the probability measure defined 
by Eq. (2-46) satisfies the following basic properties for all sets defined in 
sample space : 

0 < P{A] < 1 (2-62) 

P\A\J B} = P[A] -1-P(B} for mutually exclusive A and J5 (2-63) 

P[x\ =0 if, and only if, x = 0 (2-64) 

P{x} = I if, and only if, x = U (2-65) 


Therefore the suggested definition of the frequency can serve as a proba- 
bility measure. The implication of Eqs. (2-58) to (2-60) will be investi- 
gated in subsequent sections. 

The frequency approach is a rather common approach for defining the 
probability when dealing with physical problems. Its mathematical 
concept relies on the tacit assumption of an equiprobable measure, that is, 
the equal likelihood of the outcome of the repeated experiments. We 
assume that the measure associated with an event, in the case of the 
repeated experiment, is proportional to the number of the outcomes in the 
event under consideration. In essence, this assumption makes the fre- 
quency definition somewhat too restrictive. 

2*-9. Theorem of Addition. It seems appropriate now to continue with 



BASIC COKCEPTS OF DtSCBEiTE PBOBABILITT 41 

our formalism without restriction to an immediately practical but slow 
procedure. For two events A and B of the sample space one has 

A'J (B - AB) = AU B (2-66) 

The additive property of the probability measure in Sec. 2-7 suggests that 
m\A + B} = 'm\A] H- m{B] — m[AB\ 

P{A-\-B] =P{A} +P{B) -P{AB] <P\A \ -h P{B( (2-67) 

If two events A and B are mutually exclusive, then 

P[AB\ = P|0) = 0 (2-68) 

P{A^ B] = P{A\ + P{B] (2-69) 

For two opposite events A and A', one has A + A' = U, and since 
AA' = 0, then 

P{A^ A'\ = P{A} + P[A'} = P{U} 1 (2-70) 

P{A'] = 1 - P{A] 

For the three events A, B, and C, we may write 

P{A\JBKJC] = P{A\JB\ -|-P{C} -Pl(ylUP)C) (2-71) 
P( A U 5 U C) = PfA) + P{B] + P{C1 - P{AB) - P{BC} 

- P|PA} -|-P{ABC) (2-72) 

This is indeed made clear by employing a pertinent Venn’s diagram, 
P{ABC'} being the probability of the simultaneous occurrence of the 
three events. If the events are mutually exclusive, then 

P{AUBUC1 =P{A) -f-P(B) +P{C1 (2-73) 

More generally, for a number of events Ai, ^4 2, . ■ . , A» one may write 

P{AiU AjU • ■ • U A„) = P{Ail -fP{A 2 ) + • • • +P(A„) 

- P{A,A 2 l - P{AiA,l - ■ • • - P(A„_iA,) -t- P{AU^A^\ 

+ P{A,A,A,} + ■ ■ ■ -|-P|A„_2A„_i ■■■An]+--- 

4. (_l)»-ip{A,A2 • • ■ A„) (2-74) 

By extension of the relation in Eq. (2-66), it can be shown without 
difficulty that 

P{A, U As • • ■ Anj < P{Ai} -1- P|A-,1 -I- • • • -h P(A„} (2-75) 

The equality sign holds when the events A* and A,- are mutually exclusive 
for all fc j. 

Example 3-12. An urn contains 11 balls numbered from 1 to 11. If a ball is 
selected at random, what is the probability of having a ball with a number which is a 
multiple of either 2 or 37 



42 


DISCRETE SCHEMES WITHOUT MEMORY 


Solution. Let A and B be the events that the ball number is a multiple of 2 and 3, 
respectively. The event of interest is ^4. 

P{A\ - Hi 
PIB] = Hi 
PlAB] = Ml 

P{A * Ml + - Ki = Ki 


Example 2-18. One card is drawn from a regular deck of 52 cards. What is 
the probability of the card being either red or a king? 

Solution. Let A be the event that the card is red, and B the event that the card 
is a king. The event of interest is ^4 + Where A and B are not exclusive events, 
apply Eq. (2-67); 

P\A] = M 
P\B\ = Ma 

P\AB\ = (MaXM) = Me 
P\A + =M +M3 “Me = Ma 


Example 2-14, An honest coin is tossed 10 times. What is the probability of 
having at least (a) one tail and (b) two tails? 

Solution. The main assumption in this and in similar problems is the concept of 
independence of successive trials and the equally probable outcomes. 

I.«et A and B be the events of getting no tail and exactly one tail, respectively. 
Then 

The events of interest are 


(o) 


( 6 ) 


PIA'I 


_ 1 _ = 

1,024 T,024 


U - (A + B) = (V - A) - B = A' - B 
PIA' - Rl - 

' ' 1,024 1,024 1,024 


2-10. Conditional Probability. Consider two events A and B. The 
conditional probability of event A based on the hypothesis that event B 
has occurred is defined by the following relation : 


P\A\B]^^^ P[B]9^0 (2-76) 

The use of this definition can be justified by returning to the pre- 
viously treated case of Sec. 2-8. The frequency of thii occurrence of 
event A under the assumption that B has occurred is 


/{ A IB } = (2-77) 

712 + ns 

By the same token, the frequency of the occurrence of knowing that A 
has already occurred, is 


f{B\A) = 


n, f{AB] 
Wi + ns f{A] 


(2-78) 



BASIC OONCEFTS OP DISCKBTJB PROBABILITY 43 

Increasing the number of trials indefinitely gives 

P{A\B\ = P[B] 0 (2-79) 

(2-80) 

The two events A and B are said to be mutually independent if 

P[A\B] = P{A] (2-81) 

P{B\A\ = P{B] 

Note that for mutually independent events 

P{AB} = P[A] ‘P{B] (2-82) 


Equations (2-81) and (2-82) are alternatively used as the defining rela- 
tions for two mutually independent events.* 

Example 2-16. Three boxes of identical appearance contain two coins each. In 
one box both are gold, in one box both silver, and in the third box one is a silver coin 
and the other is a gold coin. Suppose that a box is selected at random and, further, 
that a coin in that box is selected at random. If this coin proves to be gold, what is 
the probability that the other coin in the box is also gold? 

Solution. Let 

Agg be the event that the other coin in the selected box is also gold (that is, the 
selected box is gg box) 

Bg be the event that the first coin in the selected box is a gold coin 

The desired probability is 


P( i IR ! ^ P\Pb\^aQ\ 
r \ Agg\Dg\ = P\B \ 

P[Bg] 

P[Agg] 

PlBglAgg] = 1 
P\AJB,] - = % 

* For two independent e^^ents, the defining equation (2-81) holds, but when P\A] 
* 0 or P\B\ «= 0, then P\B\A\ or P{A\B\ is not defined. For this reason some 
authors prefer to define mutual independence in such a way that Eq, (2-81) remains 
valid for all circumstances, including 


P{A\ = 0 PIA] « 1 
P[B\ - 0 P[B\ = 1 

For this purpose, the following defining equations are suggested (Fortet, p. 85) : 


P{AB] «^P\A\P\B\ P{A*B] ^P{r\P[B} 

P\AB'\ -FMl/qBq P[A*B'] «P[A^\P[B^\ 


Independent events are more specifically called statistically independent or stochasti- 
cally independent. 



44 


DIBCRETB SCHEMES WITHOUT MEMORY 


Example 2-16. In a certain group of engineers, 60 per cent have insufficient back- 
ground of information theory, 50 per cent have inadequate knowledge of probability, 
and 80 per cent are in either one or both of the two categories. What is the per- 
centage of people who know probability among those who have a sufficient background 
of information theory? 

Solution. Let 

A be those having insufficient background of information theory 

B be those having inadequate knowledge of probability 
Then 


P\A\ = 0.60 
P|fi| = 0.50 
P\A + P) = 0.80 


P\A^] = 0.40 
P{B'| = 0.50 

PM H- BY = P\A'B'\ 0.20 


It is required to find 

P\B’\A^] 


PM'B'l ^ 0.20 
PM' I 0.40 


50 per cent 


2-11. Theorem of Multiplication. The multiplication rule for the 
case of two events A and B can be obtained through the definition of 
the conditional probability. 


P\AB] = P\A]P[B\A] 

P\AB\ = P{B\P{A\B] (2-83) 

This rule can be extended to the case of more than two events. For 
instance, for three events Ay B, and C, one writes 

P{ABC} = P\AB]P{C\AB\ 

= P\A\P{B\A\P{C\AB] (2-84) 

More generally, 

P[A,yA,y . . . ,^„1 = P{A,\P[A2\Ax]P{A^\A,A,\ ■ ■ ■ 

. . . yAn-i] (2-85) 

When a finite number or a countably infinite number of events Aij 
A2t . . . , -An are mutually independent,* we have 

P{Ai,A 2 , . . . yAn} = P{Ai}P{A2\ P{Ar.} (2-86) 


* The events Ai, A 2 , . . . , An are said to be statistically independent of each other 
when the probability of any of these events and the probability associated with the 
intersection of any number of these events do not depend on any other event except 
those occurring in the intersection. That is, 


PlA^Ay] =P|^JPM;I 
PlAiAyAk] ^ P\Ai]P{Ay\P[Ak] 


P{AiyA2 An] =PlA,lPM2l • • P{A„ 

for all combinations of i, j, and k satisfying 


l<i<j<k< <n 



BASIC CONCEPTS OP DISCRETE PROBABILITY 


45 


Example 2-17. In a small library there are 1,000 books, among which 500 are 
scientific. Among the scientific books are 100 which are devoted to engineering 
subjects. Three books are chosen at random, the chosen book being replaced each 
time. What is the probability of getting 

(a) All three scientific books 

(b) Three scientific books among which only one is an engineering book 

(c) At least one of the three an engineering book 

Solution. Let S and E stand for the event of selecting a scientific and an engineer- 
ing book, respectively. The events of interest discussed in the problem are 

(a) S1S2S3 

(h) (SiE}(S2E'){S,E') + (SiE')(S2E){S,E') + {S^E')(S2E')(S,E) 

(c) U - E[E'X 

(a) PI.SUSVS3I = piSiiP{S2iPiSai = my = H 

(b) P[SE\ = P{E\S]P\S\ 

P{SE\ =H-y 2 =Ko 
P(*Sf^'l =P(^'|5| -PliS) 

P(5P'l = = Ho 

3P{{SiE)(S2E')(S,E')] = 3 ■ Ho ■ Jio ■ = 0.048 

(c) P{U -E[E'X\ = 1 -p{e[eX\ 

P{U - E[E'X^] = 1 - (Hoy = 0.271 

Example 2-10. Four persons write their names on individual slips of paper and 
deposit the slips in a common box. Each of the four draws at random a slip from the 
box. Determine the probability of each person retrieving his own name slip. 

Solution. Let Ek be the event that the Mh person retrieves his own name slip. 
The event of interest is EiE2EzEi. Equation (2-85) yields 

P{EiE 2 EaE 4 } = P \E1\E2E9E4] ' P[E2\EiEi] -Pl^^alEi) -PIP^I 
p[E^E 2 E,E 4 \ = 1 ■ ■ H • H = 

Example 2-19. The probability of the closing of each relay of the circuit of Fig. 
E2-19 is a given a. Assuming that all relays act independently, what is the proba- 
bility of a current existing between terminals A and B? 



Fig. E2-19 


Solution. Let the event of closing each relay 4, 2, 3, and 4 be Ei, E 2 , Es, and E«, 
respectively. The four events are independent but not necessarily mutually exclusive. 
The event of interest is 

E = E 1 E 2 “h EsEi 

PIEt = PlEiEa -hEaE4l = PjEiEa) +P(E3E4l - PiEiEiE^E^] 

P{E] =P|EilP(E2l +PlE,|P(E4l -P\Ei]P{E2]P{Ei]P{E4\ 

P\E\ = 2a® — a* 




46 


DISCRETE SCHEMES WITHOUT MEMORY 


Note that 

P{01 - 0 
P(l| “ 1 

0 <^1^71 <1 forO <« < 1 

2-12. Bayes’s Theorem. In many problems we wish to concentrate 
on two mutually exclusive and exhaustive events of the sample space, 
that is, two events Ai and A 2 such that 


A 1 A 2 = 0 
Ai + A2= U 


(2-87) 


The assumption is that each of these events has a subevent of special 
interest to us. If the subevents are indicated by EAi and EA2y then the 
event of interest E = EAi + EA2 can occur only when AiOt A 2 occurs. 
The conditional probabilities P{E\Ai} and P\E\A2} are assumed to be 
known; we are also given the information that E has occurred. The 
problem is to determine how likely it is that E has occurred because of the 
occurrence of either of the two events Ai and A 2. In mathematical 
notation, given 

P{A,] = 0)1 PM 2 I = 0)2 (2-88) 

A^ + A2=U A,A2 = 0 

P{E\Ai] = Pi P{E\A2} = P2 

find P{Ai\E} andPMalPl. 

The computation can be done in a direct way by applying the rule of 
addition and multiplication. Note that 

E = A lEKJ A 2E (2-89) 

As AiE and A 2 E are mutually exclusive events, we may write 
P{P} = P{AiE\ +P[A 2 E] 

These probabilities can be calculated as follows : 


P{A,E] = P[A,]P[E\A,] 

P[A2E\ = P[A2\P[E\A2\ 

Therefore, 

P{E] = PMi}P{i?|Ax) + P{At]P[E\Ai} 
P{Ar\E) = = P{A,]P{E\A 


(2-90) 


P{E] P{A,\P{E\Ar] + P{Ai\P{E\A,} (2-91) 

PiA\F] P|A,}P|P|Ax} 

' " ’ P{E\ P{Ai|P{£:|Ai} + P{A,\P{E\A,} 



BASIC CONCEPTS OF OISCBBTE PROBABILITY 


47 


Finally one finds 


P[A,\E] 

P{A2\E] 


wiPi 


<i)ipi + a)2p2 
W 2 P 2 

oiiPi + W2P2 


(2-92) 


The probabilities expressed in Eqs. (2-92) are called the a posteriori 
probabilities of Ai and A 2 , given E. The probabilities wipi and W 2 P 2 
are termed the a priori probabilities of E, given Ai and A 2 . Equations 
(2-92) provide a means for calculating the a posteriori probabilities from 
the a priori probabilities. Equations (2-92) are known as Bayes^s rule. 
It is of interest to note that Bayes's rule applies to a partitioned sample 
space, as shown in Fig. 2-19. The events Ai and A 2 may each consist of 



(a) (6) 


Fig. 2-19. (a) Thevenin’s partitioning. Ai, sl part of the network; A 2 , the remainder 
network. (6) Bayes’s partitioning. 


sets containing a number of subevents. Electrical engineers may note 
that Bayes's rule is somewhat similar to Th6venin's theorem in network 
theory. Th6venin's theorem permits a partitioning of the network into 
two parts and a study of the system with respect to one pair of terminals 
of the partitioned boundary. 

Bayes's theorem, like Th6venin's theorem, can be extended to a parti- 
tioning of the sample space into mutually exclusive and exhaustive parts. 
Suppose that an event E can occur as a result of the occurrence of several 
mutually exclusive and exhaustive events i4i, A 2 , . . . , An^ Let the 
corresponding conditional probabilities be given as 


and let 


P{E\Ak] = pk fc = 1, . . . , n 
P[Ak] = m 


Then, by the law of addition, we have 


(2-93) 


E ^ AiE + A 2 E + • • • + AnE (2-94) 

P{E] P[AiE + A 2 E + — • + AnE] (2-95) 

P[E\ = WiPi -f- Ci)2p2 + ■ ■ ' + Wrtpn (2-96) 




48 


DISCRETE SCHEMES WITHOUT MEMORY 


The question is to find the a posteriori probability of the occurrence of 
event Ak, given the occurrence of E. 


PI Am . = _ p jA . . iP|g|x.i (2.„) 

I PIE\A,\P\AA 

or, equivalently, 

P{A,\E} = (2-98) 

a)lPl + O32P2 + * • ■ + 0 )nPn 


This equation comprises what is known as l^ayes’s theorem. 


Example 2-20. Let f/i, f/a, be three urns with two red and one black, three red 
and two black, and one red and one black balls, respectively. One of the three urns 
is chosen at random and a ball is drawn from it. The color of the ball is found to be 
black. What is the probability that it has been chosen from f/s? 

Solution. This is an example of a situation where Bayes’s theorem can be applied. 
Let E be the event that a black ball has been drawn; At is the event that the ith urn 
has been chosen, i = 1, 2, 3. 

Then 

P\E\A,\ = }i P\E\A,\ = P\E\A,\ = 

Also, P\A 3 \E\ — P I choosing urn L^alblack ball drawn | 

P{A,\P\E\A,\ 

3 

Y, P\E\A.\P\A.\ 

t-1 

}i ■ H ^15 

mi ^ HA- y2) ^7 


Example 2-21. Three urns are given : 

Urn 1 contains two white, three black, and four red balls. 

Urn 2 contains three white, two black, and two red balls. 

Urn 3 contains four white, one black, and one red ball. 

One urn is chosen at random, and two balls are drawn from that urn. If the two 
balls happen to be white and red, what is the probability that they were drawn from 
urn 3? 

Solution. Let A* = event of choosing urn i = 1, 2, 3 

RW = event of choosing a red and a white ball 
We want P[A^\RW\. 

Using Bayes’s rule, 




But 


P\A^]P\RW\A,\ 

P\Aj\P\RW\Ai) -\-P\A2]P\RW\A2] +P\A,\P\RW\A, 
P\A,\ =PM3| = Vb 


P\RW\A,] 


P\RW\A2] 


8 



6 



36 

21 



BASIC CONCEPTS OF DISCRETE PROBABILITY 


49 


Therefore, 


P\RW\A,\ = ^ 


j4 

15 


P{A^\RW\ 


M ^21 

+ %i H- Ks) 61 


Bayes^s* theorem comprises one of the most used, and occasionally 
misused, concepts of probability theory. In many problems an event 
may occur as an “effect” of several “causes.” From a number of obser- 
vations on the occurrence of the effect, one can make an estimate on the 
occurrence of the cause leading to that effect. This rule is frequently 
applied to communication problems, particularly in the detection of sig- 
nals from an additive mixture of signals and noise. When the detecting 
instrument indicates a signal, we have to make a decision whether the 
received signal is a true one or a false alarm due to undesired signals 
(noise) in the system. Such decisions are generally made possible by an 
application of Bayes’s rule which is also called the rule of inverse proba- 
bility. The decision criterion may be made more effective by introducing 
some kind of weighting coefficients called loss matrix and minimizing the 
over-all “loss.” 

2-13. Combinatorial Problems in Probability. In many problems 
involving choice and probability, the number of possible ways of arrang- 
ing a given number of objects on a line is of interest. For example, if 
three persons B, and C are standing in a line, the probability that A 
remains next to B can be calculated as follows: There are six different 
arrangements possible : 

ABC ACB BAC BCA CAB CBA 

Of these arrangements, there are four desirable ones. Thus, if the 
concept of equiprobable measures is assumed, the probability in question 

Combinatorial problems have a limited use in our subsequent studies. 
For this reason, we shall give only a review of the most pertinent defini- 
tions in this section. The reader interested in combinatorial problems 
will find a considerable amount of information in Feller (Chaps. 2 to 4). 

Permutation: A permutation of the elements of a finite set is a one-to- 
one correspondence between elements of that set (such a correspondence 
is also called a mapping of the set onto itself). For example, if a set 


* Reverend Thomas Bayes's article An Essay towards Solving a Problem in the 
Doctrine of Chances was published in Philosophical Transactions of the Royal Society 
of London (vol. 1, no. 3, p. 370, 1763). However, Bayes's work remained rather 
unknown until 1774, when Laplace discussed it in one of his memoirs. 



so DISCRETE SCHEMES WITHOUT MEMORY 

contains only four objects C, and Z>, we may write two equivalent 
sets 

Af Bj C, D and B, C, A, D 


^ ^ . I A ^ C Dl B C A Dl 

The ordered seta ^ 2 3 4 1 2 3 4 ^^^ 


mutations of the elements of the original set, since 


A- B 
B C 
C A 
D D 


(2-99) 


The following definition is of considerable assistance in dealing with 
combinatorial problems. 

Factorial: The factorial function for a positive integer n is defined as 
n! = n(n - l)(n - 2) • 4 • 3 • 2 • 1 (2-100) 

with the additional convention 


0! = 1 (2-101) 
The number of different permutations of a set with n distinct elements is 
Pn = n{n — l)(n — 2) • • ■ 4 • 3 • 2 ■ 1 = n! (2-102) 

Combination: The number of different permutations of r objects 
selected from n objects is 

P," = n(n - l)(n - 2) (n - r + 1) = (2-103) 

Every permutation of elements of a set contains the same elements but in 
different order. When two sets of objects are in one-to-one correspon- 
dence so that some of the elements of one do not appear in the other they 
are called different combinations. For example, if we combine the mem- 
bers of the set ( two by two, ABj AC, DB are different com- 
binations but AB and BA are not. 

The number of different combinations of n objects taken C-at a time is 


Cr 


n . 


r! 


n\ 

r\{n — r)! 


(2-104) 


When confusion will not result, one may use the notation 
Note that 



for Cr\ 



BASIC CONCEPTS OP DISCRETE PROBABILITY 






+ ■ • • + 




51 

( 2 - 105 ) 

( 2 - 106 ) 

( 2 - 107 ) 

( 2 - 108 ) 


The following theorem is often used in combinatorial problems. Let 
a set contain k mutually exclusive subsets of objects: 

{AlyA2, . . . ,Ak] 

with Ai — i = Ij 2, . . . j /c 

being the number of elements in the set A,. The number of permuta- 
tions of the total number of elements n is 


n! 

ni!n2! • • ■ n^! 


( 2 - 109 ) 


In fact, one has to divide the number of permutations of n objects by 
nj. (fori = 1, 2, . . . , /c) since the permutations of the identical objects 
of the Ai set cannot be distinguished from each other. For example, 
the number of color permutations of three black and two white balls is 


5! ^5X4 
3!2! 2! 


10 


Binomial Expansion: Let n be a positive integer; then 
(a + 6)" = a” + a^~^b + + ■ * ■ 


+ (”) + • • • + b“ (2-110) 

or 

(o -|- 6)" = o" -f- na^-^b + ^^^2! 

w(ra - l)(n - 2) ^ ^ (2-111) 

A useful display of a binomial coefficient is given in a table which is 
called PascaVs triangle: 



52 


DISCRETE SCHEMES WITHOUT MEMORY 


(S) 


C) 


(i) 

1 1 


(o) 

(?) 

(?) 

1 2 1 

(2-112) 

(o) (0 


(?) (?) 

13 3 1 


© (1) ' 

(^) 

(?) © 

1 4 6 4 1 



In the following a number of simple examples dealing with permuta- 
tions and combinations are presented. In these examples, the primary 
assumption is that the probability is given by the frequency of the event 
under consideration; that is, the concept of cquiprobable measure prevails. 
Hence, such problems are reduced to a study of the ratio of the favorable 
cases to all possible cases. In this respect the formula of combinatorial 
analysis will be used. 


Example 2-22. What is the probability of a person having four aces in a bridge 
hand? 

Solution. The number of all possible different hands equals the combination of 
13 from 52 cards. For the number of favorable cases one may think of first removing 
the four aces from the deck and then dealing all possible combinations of hands 9 by 9. 
The addition of the four aces to each one of these latter hands gives a favorable case. 

/48\ /52\ 10 11 12 -13 11 

\9 ) * \13y “ 49 ■ 50 ■ 51 - 52 “ 4,i65 


Example 2-23. Two cards are drawn from a regular deck of cards. What is the 
probability that neither is a heart? 

Solution. Let A and B be the events that the first and the second card are hearts, 
respectively; then we wish to know P\A'B']. 


Therefore 


P\A^\ = 1 - Pl^l = 1 - = 

P[A^B’\ = 3%i = 1^4 


If we wish to apply combinatorial principles, we may say that the number of all 


possible cases of selecting two cards is 



The number of favorable cases is 



Therefore the probability in question is 


/39\ . /52\ _ 39! 2150! 39 ■ 38 19 
\2 y ’ \2 y “ 2137! 52! “ 51 ■ 52 “ 34 

2-14. Trees and State Diagrams. The material of this section is 
intended to offer a graphical interpretation for certain simple problems of 
probability which arise in dealing with repeated trials of an experiment. 



BASIC CONCEPTS OF DISCRETE PROBABILITY 


53 


For example, suppose that a biased coin is tossed once ; the outcome may 
be denoted by H and T and shown by the diagram of Fig. 2-20. Simi- 
larly, if the coin is tossed twice, the second set of outcomes may be shown 
in the same treelike diagram. If the probability of getting a head is 
denoted by p, then the probability of getting, say, HT can be directly 
computed from the weighted length of the associated tree path, that is, 

p(l - p) 

If it is desired to obtain the probability of getting a head and a tail 



irrespective of their order, then the answer to the problem is given by 
summing up the two weighted tree paths. 

p(l - p) + (1 - p)p = 2p(l - p) 

This simple graphical procedure can be used profitably in certain types of 
problems. The following are examples of such problems. 

Example 2-24. The urn A contains five black and two white balls. The urn B 
contains three black and two white balls. If one urn is selected at random, what is 
the probability of drawing a white ball from that urn? 

Solution. From the tree diagram of Fig. 2-21 one can see that the probability of 
the event of interest is the sum of the following measures: 


Example 2-26. Find the probability that at least three heads are obtained in a 
sequence of four throws of an honest coin. 

Solution. From a tree diagram or from the binomial expansion one obtains 

(4) • (HV + (3) • (HmiV = Me + He = ^6 

If a coin is tossed n times, we note that the probability of getting, say, 
exactly r heads (r < n) is the sum of the tree measures of all tree paths 

leading to r beads and w — r tails. Since there are such states, it is 



DISCRETE SCHEMES WITHOUT MEMORY 
found that the desired probability is 


• (1 - p)— (2-113) 

The tree diagram can easily be drawn for experiments with a finite 
number of outcomes. In the problems discussed thus far in this section, 
it is tacitly assumed that the outcomes of each experiment remain inde- 
pendent of the previous experiments. In engineering terminology such 
experiments are said to lack memory. For these experiments the 
probability of any outcome is always the same. That is, an outcome of 
the nth trial has exactly the same probability of occurrence as in the fcth 
trial (k 9 ^ n). This type of experiment leads to the concept of so-called 
independent stochastic processes. In certain types of problems an out- 
come may be influenced by the past history or “memory^' of the experi- 
ment. Such experiments are termed dependent stochastic processes. 
Among the latter type, perhaps the simplest ones are those experiments 
in which the probability of an outcome of a trial depends on the outcome 
of the immediately preceding trial. These are called Markov processes. 

Let an experiment have a finite number of n possible outcomes, oi, 
a 2 , . . . , and Un, called states. We assume the process to be of the finite 
Markov type and initially in the state k. For a Markov process, we 
specify a table of probabilities associated with transitions from any state 
to any other state. This is called a probability transition matrix. 



Ol 

O 2 

08 

' • On 

ai 

Vn 

P 12 

P 18 • 

• • Pin 

02 

P21 

P 22 

P 23 

P2n 

da 

Pai 

P32 

P83 

Pan 

a„ 

-Pnl 

Pn2 

Pn3 * 

Pnn 


(2-114) 


Pjk = p[o>k\aj] denotes the probability that the next outcome of the experi- 
ment will be the state fc, given that the immediately preceding experiment 
led to the state j. Note that in a transition probability matrix the sum 
of all elements of each row must equal unity. 

One of the most common problems associated with the Markov process 
is, given that it started with state j, to find the probability of reaching the 
state k after a specified number of steps r, that is, p[ak\ai\^^^ = 

This question has a rather simple answer, namely, (1) draw the tree 
diagram, (2) select all tree paths connecting the node representing the 
state j to that of the state fc in r steps, and (3) add the corresponding tree 
measures. This procedure is exemplified in the tree diagram of Fig. 



BASIC COKCEPTB OF DISCRETE PROBABILITY 


55 


2-22 for r = 1 and r * 2. When r = 1, the answer is obvious : 

P{o*K}(» = P{a,}P{o*|o,l (2-115) 

For r = 2 one has to add the probabilities of reaching state at from the 
state ay in all possible ways, that is, the sum of the measures associated 



Fio. 2-22. Trees for a finite chain, (a) r = 1. (6) r = 2. (c) r = 1. (d) r = 2. 

with all three paths connecting a, to a* in two steps. 

Pla*|a,)<« = P{ay)[P{a,|a,lPla*Kj -|- P{o 2 |oy}P{ot|a,| 

+ ■ ■ ■ + P{a„|aj}P(o*|a„)] 

= Plo,l 'l P{o.|ay}P{a*la.} = P{oy) J ' Pi^ 

«-l i“l 

For r = 3, 

P{o*|ay}‘« = 

(/ *■ 1 t — 1 * 

By defining the initial probability of different states as a diagonal matrix, 


[Pi)<«] = 

■p{oi} 

0 

0 

P|a,} 

0 ■ 
0 


0 

0 

• • • P|a«L 


56 


DISCRETE SCHEMES WITHOUT MEMORY 


we can sum up the above development in concise matrix notation, 
is, for any states j and k we have 


Similarly, 


[P{a,|a.}(^>] = IPn^nP] 
[Pla^a,m = [PD^nP][P] = [Pd^HPV 


That 


For the general case, 

lP{aj\a,V^n = (2-116) 

This relation determines the probability P{aj|aA:}^’'^ for any values of j, 
fc, and r. 

Consider next the probability of reaching the state ak in r steps, given 
that the initial state could have been any a,, i = 1, 2, . . . , n, that is, 
the probability of getting to ak (in r steps) when any of the n states could 
have been the initial state. Let this probability be symbolized by 
P[ak\ Figure 2-22c and d illustrates the case for r = 1 and r = 2, 

respectively. 

For r = 1 , 

P\a,\ |(») = X P|a,lP{at|a.} 

For r = 2, 

PW,\ !«’ = X x" P|a.lP(a,|ailP{o*|a„l 

For r = 3, 

= X i E P|ai)P|a,|ai!P|a»l«.}P(a*|a^) 

i-1 y-1 


The matrix formulation follows immediately. Let [P°] be a row matrix 
describing the initial probabilities [P(ai}, P{ 02 ), . . . ,P[o„)];then 


[P(a.| 1^^>] = [P^nP] 

[Pla.| 1<^>] = [P^^WV 

For the general case, 

ip\<ik\ }^^>] = [p^^py ( 2 - 117 ) 

This relation determines the probability P{ak\ for any values of 
positive integers fc and r. Note that P{ak\ will always be a row 
matrix since [P^^'>] is a row matrix. 

Example 2-26. A relay alternates between the open state denoted by 1 and the 
closed state designated by 0. The transition probability matrix is given as 


1 

0 


1 0 


[H H 
IH H 


] 



BASIC CONCEPTS OP DISCRETE PEOBABILITV 57 

ABBuming that the initial probability of the relay being in either state is determine 

(а) The probability of reaching state 1 via state 0 in one step, that is, po/'*- 

(б) Poo*‘*- 

(c) poi'*>. 

(d) p .(»• 

(e) p ( l| 1 the probability of reaching state 1 in two steps. 




SohUion. The state diagram and the tree diagram are drawn in Fig. E2-26. 
According to the tree diagram, 

(а) Po.<« = 

(б) Po.'‘> = H • H 

(c) poi<*' = mi ■% + % yi) = % 

(d) pn'« = yi(H ■%+H-H) =Hh 

(e) p{l| )<« - P.1<« + P01<»' = Ks + ^ 

An alternative solution for part (e) is given by the matrix relation of Eq, (2-117). 



Therefore, p|l| |(*) = p|0| 

Finally we may answer the same question by using the materials of Secs. 2-9 
(Theorem of Addition) and 2-10 (Conditional Probability). 

(a) pl01\ = p{0]p{l\0\ = H ■ H = H 

(h) pm\ = p{o\pm\ = yi ■ % = H 

(c) plOOl) -1- plOll) = pIOOIpUIO) -t- pI01)p11|11 

= yi-yi + yiH = ?i 

(d) pllll) -1- p|101| = p1111p|1|1| + pll0|p|l|0| 

~yi-}i-H +yi-Hyi = ^8 

(«) pUI = 

Example 2-27. A communication source having a three-letter alphabet transmits 
sequences of messages. The transition probability matrix is given below: 


ABC 




58 


DISCRETE SCHEMES WITHOUT MEMORY 


For the beginning of each message, letters A, B, C occur with probabilities Hst Hsf 
and Ksi respectively. 

(a) Determine the probability of getting a message commencing with 

AB, BB, CA, 

ABA, BBC, CAC 

(b) Find a set of initial probabilities which will produce a so-called steady state/’ 
i.e., the probability that the letter transmitted at the nth state does not depend on n. 

Solution 

(а) The probabilities in question are, respectively, 

%4 Ms ■ H = %4 14 % 3^62 

^4 ■ H - ^62 ?^4 ■ I'i = ^62 ®?i62 * % = ^5-243 

(б) The desired initial probability matrix must satisfy the condition 

[p(o)][p](«) — [pco)][p](n-i) ^ a positive integer 

In particular, 

[P(0)][P] [PW] 

Therefore a = 

^ = otli + 

7 = ot% + + yH 

These equations lead to 

a = H /3 = K y = 

It can be shown that, if one considers very long messages, the frequency of the 
occurrence of the letter A will approach etc. For further comments on the 
Markov chain, see Chap. 11. 

2-16. Random Variables. In the preceding sections the concept of 
an event and of sample space of an experiment played an important role. 
The discussion of the present section is aimed at an intuitive introduction 
of random variables. 

Most experiments of practical interest have numerical outcomes; 
that is, the result of the experiment is a number, or a pair of numbers, 
etc. In other words, the results can be described by using a coordinate 
space, the coordinate space being in a correspondence with the sample 
space of the event. 

A random variable is a real-valued function defined over the sample 
space of a random experiment. Restricting the random variable to 
assume only real values is quite natural, as one is interested in the 
numerical outcomes of an experiment (even though in various practical 
applications complex values of random variables are also considered). 
The word “random’^ stresses the fundamental fact that we are dealing 
with experiments governed by laws of chance rather than any deter- 
ministic law. The throws of a symmetrical die or coin under hypo- 
thetically symmetrical conditions represent random experiments. The 
salient feature of these experiments is that, even though they exhibit a 
certain kind of regularity when repeated over a long range of time, it is 



BASIC CONCEPTS OF DISCRETE PROBABILITY 59 

impossible to predict, with complete certainty, the outcome of any 
particular trial. 

Let 12 be the sample space of a random experiment. Each point of ft 
describes a possible outcome of the experiment. This outcome may 
not be a numerical result in itself but some numerical data can be assigned 
to it. For instance, if the experiment were the picking at “random^’ of a 
card out of a deck of 52, the number of possible outcomes at any particu- 
lar trial would be 52, depending upon which one of the cards had been 
picked. Here, although the outcome does not furnish us with a numerical 
result, we can represent the possible outcomes by, say, the first 52 
integers or by 52 points on a line. 

The correspondence between a point of ft and a point in the coordinate 
space is designated by a mathematical function. This function is termed 
a random variable. Generally, we shall denote random variables by 
capital letters such as X and F, and their specific values by the same 
letters in lower case. A random variable X assumes different values 
Xif X 2 , . . . , Xn, . . . which are points of the coordinate space. The 
coordinate space may be a one-dimensional or a multidimensional space. 
The random variable may take a finite number of n-tuple values or 
infinitely many. The sample space may be a space with finite or 
countably infinite points or even a continuous space, that is, with an 
uncountable number of points. The following practical examples illus- 
trate some possibilities. 

Example 2-28. The experiment is throwing an ordinary honest die. The sample 
space has six events of interest. The associated random variable takes only six possi- 
ble numerical values, 1, 2, 3, 4, 5, and 6. Each of these real numbers corresponds to a 
specific event. 

Example 2-29. The experiment is throwing three honest dice. ''Fhe associated 
random variable takes on 6* different numbers of triads as values. The random 
variable may be conveniently represented by a point in the three-dimensional euclid- 
ean space. 

2-16. Discrete Probability Functions and Distributions. Consider 
ft the sample space of a random experiment. If the outcomes of this 
experiment can be put into one-to-one correspondence with the positive 
integers, the sample space will contain a countable number of points. 
Such a sample space is said to be a discrete sample space. In a discrete 
sample space, when the random variable X assumes values 

lXijX2f ■ • • • • •] 

the probability function /(a;) is defined as 


wh^re 


[PhPij * . . . . .] 

f(xk) « P[X = ojfc} « pk 


( 2 - 118 ) 



60 


DISCRETE SCHEMES WITHOUT MEMORY 


The probability distribution function Fix), known also as the cumulative 
distribution function (CDF), is defined as 

(2-119) 

For example, the throw of an honest coin until a head appears is a random 
experiment. The sample space of this experiment is a discrete space. If 
X corresponds to the event of the appearance of the first head on the 
kth throw, then X assumes the following values: 

[X] = [1,2,3, . . . ,fc, . . .] (2-120) 

The probability function f(x) and the CDF are 


fix) = [2-S2-2,2-3, . . . ,2-^ . . .] (2-121) 

Fix) = 2-1 + 2-2 -h • ■ • + 2-* 

These functions are plotted in Fig. 2-23o and 6, respectively. 

The definition of the probability function and CDF can be directly 
extended to the case of a multivariate random variable. For instance, 



Fig. 2-23. (o) Probability function associ- 
ated with Eq. (2-121). (6) CDFassoci- 

p,ted with Eq. (2-121). 


in Example 2-29 the sample space is 
a three-dimensional euclidean space 
with 216 points. The random var- 
iable X assumes 216 triad values 
X = iXi,X 2 ,X 2 ) for any experi- 
ment. The corresponding proba- 
bility function is 

fixi,X2,X3) = P{Xi = Xi, 

X2 = X2, Xz = 0:3) 

fiXi,X2,X3) = yi ’H = Hi6 
Here all permissible outcomes have 
equal probabilities. 

The CDF" gives the total proba- 
bility of the set of points having 
each coordinate less than or equal 
to some specified value (a:i,a; 2 ,X 3 ), 
that is, 

FiXi,X2,X3) = ^fiXi,Xj,Xk) 

for 

x^ < [xi] Xj < [ 0 : 2 ] Xk < [xa] 

where [ ] denotes the greatest in- 
teger contained in the letter inside 
the brackets, 



BASIC CONCEPTS OP DISCRETE PROBABILITY 


61 


2-17. Bivariate Discrete Distribution. The case of a random variable 
assuming pairs of values (Xi,yk) is of particular interest. In fact, in 
most engineering problems the interrelation between two random quanti- 
ties leads to a bivariate discrete distribution. The joint probability 
function and the CDF are defined as before : 

f(x,y) = P{X = X, Y = y} (2-122) 

F{x,y) =P{X<x,Y< y] 

If the joint probability function fix^y) is known, say in the form of a 
matrix, then there are four additional quantities of interest which can be 
readily computed. These are marginal probability functions and mar- 
ginal CDF's as defined below: 

fl{x^) = P{X = Xi, all permissible F's} = ^fixi^y) 

V 

MVo) = P[Y = y,, all permissible X*s} = ^/(x, 2 /y) 

X 

Fiix^) = X (2-123) 

Xk <Xt 

= 1 UVk) 

Vk<Vj 

The indices 1 and 2 in the marginal distributions are simply to indicate 
that /i(:c) refers to the variable x, that is, the first variable, and f 2 {y) 
to the second variable. Now assume that all pairs of values (Xi^yj) are 
written in a matrix form : 



(2-124) 


The corresponding probabilities can be written in a similar form : 


Pu 

Pl2 

• • • Pin 

UM] = 

p22 

* • • Pin) 

Pml 

Pm* 

Pmn 


(2-125) 


The marginal probability fiixz) is the probability of the occurrence of 



blBCttS11!l SCttSUEB WtTHOVr MEMOItr 


Hi 

events for which X = X 2 without regard to the value of Y. This is 
readily obtained by adding the terms appearing in the second row of the 
probability matrix. 

fl(X2) = p21 + P 22 + ■ ' * + P2n (2-126) 

Similarly, the marginal distribution fziyk) can be obtained by adding the 
terms of the kth column of the joint probability matrix. For example, 

= P 12 + P 22 + ‘ • ■ + Pm2 (2-127) 

If the random variables X and Y are such that for all values of (x,, 2 />) 
we have 

= fiMfiiyj) (2-128) 


then the variables are said to be statistically independent of each other. 
For example, the simultaneous throw of two honest coins has the fol- 
lowing outcomes: 


Evidently, these two variables are independent of each other, since for 
any entry of the probability matrix we have 

P{X = H\\ = pii -|- P 12 = 

pjF = T2) = P12 + p 22 = yi 

P[X = H\y Y = T 2 ) = (pii + Pi2)(pi2 + P22) = 

Conversely, a check for independence is to determine if Eq. (2-128) holds 
for all possible outcomes. 

The conditional distributions can also be defined and obtained in a 
straightforward manner. The conditional probability F { X = x^\Y = yj\ 
is designated as/(a:^|y,). That is, if the computation of S(x^\yj) is desired, 
then we concentrate on the jth column of the {x^y) matrix. 


^iVj 

X2yj 


[X^Y = y,] = 




L^myjj 


(2-129) 


Next the term Xt^j is selected and its associated probability is obtained. 


= x,| F = yA = Kx,\yj) = (2-130) 



BASIC CONCEPTS OP DISCRETE PROBABILITY 63 

It is to be noted that f{xi\yi) is a permissible conditional probability 
function as all its terms are nonnegative and 


^ fix,\y,) 

t-1 


My,) ^ 0 


My,) _ , 
Mvi) ~ 


( 2 - 131 ) 


Similarly, the conditional probability of Y, given X = x., is found to be 

Kv,\x<) = (2-132) 

fl{Xi) 9^ 0 

Example 2-30. Consider the simultaneous throw of two honest dice X and Y. 
Find P(3 < X < 5, 2 < Y < 3| and the marginal probabilities. 

Solution. The two-dimensional random variable assumes 36 pairs of values, each 
with an equal probability of 

— lid for each point of the sample space 
F(x,y) ^P{X <x,Y < y) 

The marginal CDF’s are 





[r] 

e 


w 

6 



F 

(x) = 

1 

y Vi3,k) = 






-1 







Ivl 

0 


[//I 

6 



F* 

(?/) = 

I 

y pij,k) ^ 





j 

-1 




V 

1 

2 

3 

4 

5 

6 

/.(*) 

1 

He 

He • 

He 

He 

He 

! He 

H 

2 

He 

He 

He 


He 

He 

H 

3 

He 

He 

He 

He 

He 

He 

H 

4 

He 

He 

He 

He 

He 

He 

H 

5 

He 

He 

He 

He 

He 

He 

Ve 

6 

He 

He 

He 

He 

He 

He 

Ve 

My) 

H 

H 

H 

H 

H 

H 



The probability of having 3 < AT < 5 and 2<y<3i8j^6*3^^- The marginal 
probabilities arc P13 < Z < 6| « H and P|2 < F < 3| * Note that the 
two variables are independent, since, for all entries of the probability matrix, 

He - H ■ H. 

2-18, Binomial Distribution. Consider a random experiment with 
only two possible outcomes, Ei and E 2 . Let the probability of the 
occurrence of Ei and E 2 be p and g = 1 — p, respectively. If the 
experiment is repeated n times and the successive trials are independent 




64 


DISCRETE SCHEMES WITHOUT MEMORY 


of each other, the probability of obtaining Ei and E 2 r and n 
respectively, is 



r times, 


(2-133) 


This can be proved as follows : The probability of any sequence having 
r events Ei and n — r events Ei is as the successive trials are 

assumed to be independent of each other. Moreover, the number of 
such sequences is equal to the number of combinations of n objects r 
at a time. Hence the formula of Eq. (2-133) follows by the addition 
rule of probabilities. 


r 

(?) 

p ' 



0 

1 

1.000 

0.216 

0.216 

1 

3 

0.400 

0.360 

0.432 

2 

3 

0.160 

0 600 

0.288 

3 

1 

0.064 

1.000 

1 

0.064 


0 12 3 

Fks. 2-24. Example of a binomial probability function. 


Let US now define a random variable X which takes the values r if in a 
sequence of n trials there are exactly r Ei. Then by Eq. (2-133) 

fir) = P[X = r) = (") p'g— (2-134) 

lx] 

Fix) = P{X < X) = ^ 

r-0 

The distribution function of the random variable X is a step function of 
the type shown in Fig. 2-236. The corresponding probability density 
function is shown in Fig. 2-24. 

Example 2-31. What is the probability of getting exactly three I’aJn five throws 
of a die? What is the probability of obtaining at most two I’s? 

Solution. According to Eq. (2-134), for p = H; ^ and r « 3, one 

writes 


P\X 


31 = 


1 \» 

6 , 


sy 

6 ) 


250 

7,776 


For the second part of the problem, 

/R\ / 1 \o /r;\6 


/R\ /R\4 /E\ /1\2 /R\« 


0.96 




BASIC CONCEPTS OP DtSCRETE PEOBABILITY 65 

Example 2-82. In a game of n throws of a die, for what value of n is the probability 
of getting at least two 6’s larger than ? 

P|2,3, . . . ,7i6'8l > H 

Solution 



The numerical answer to this inequality is found to bo 


n > 10 

2-19. Poisson’s Distribution. A random variable X is said to have a 
Poisson probability distribution if 

PlX = x}=e-^^ (2-135) 


where X > 0, rr = 0, 1, 2, . . . , and 01 = 1. 

The corresponding cumulative distribution unction (CDF) is 


F(x)=^e-^^ x>0 

k*0 

F{x) =0 X < 0 


(2-136) 


It is to be noted that F{x) satisfies the conditions required for a distribu- 
tion function. In fact, F{x) is monotonic, increasing, and, moreover, 


F(0) = 




It is of interest to note that the Poisson distribution is a certain type of 
limiting case of the binomial distribution, in which p is a specified func- 
tion of n, namely pn, where 


lim npn = X > 0 


Then 


lim = e-^K (2-137) 

n— ► * f ^ • 

The validity of E(i. (2-137) can be checked through the following alge- 
braic manipulations : 


fix) 




n{n — 1) 


(n - X + 1) 




x\ 


(1 - pn)«"* (2-138) 



66 


/(*) 


DISCRETE SCBEMES WITHOUT MEllOST 


,, , (1 - l/n)(l - 2/n) 

'<*> (1 - P.)- 

But Um <' “ 


[1 - (a: - l)/n] V 


[1 - (a: - l)/n] 


(1 - Pn)* 


= 1 


(2-139) 

(2-140) 

(2-141) 


Therefore, 

lim (1 — pnY = lim [(1 - (2-142) 


Finally, for the limiting case we find 


/(x) = e 



(2-143) 


Thus, in the binomial case, if the number of trials n becomes reasonably 
large and the probability of individual success p is relatively small, so 
that their product np = X is of moderate magnitude, the probability of 
the number of successes in n trials approaches the Poisson distribution. 
The following relative magnitudes illustrate a common range of applica- 
tion for Poisson^s distribution: 


n > 50 p < 0.1 X < 10 

In Chap. 6 it will be shown that X is the ^'average^^ value for a random 
variable with a Poisson distribution. 


Example 2-88. Assuming that, on an average, 3 per cent of the output of a factory- 
making certain parts is defective and that 300 units are in a package, what is the 
probability that, at most, five defective parts may be found in a package? 

Solution. The * ‘ average number of defective parts in a package is 300 X 0.03 = 9. 
Assume a Poisson distribution with this average, i.e., 


X = np =» 9 

According to Eq. (2-143), the probability of a box containing x defective parts is 


xl 


[xl 


e-«>9* 

A;! 


F(x) - p(A < X) = y 

Fm « ^ ^ -L -L 


Example 2-84. An industrial process has been running in control with 0.5 per cent 
defectives. Find the smallest integer k such that the probability of getting k or more 
defectives in a random sample of 100 is less than 0.10. 



BASIC CONCEPTS OF CISCBETE PROBABILITY 


67 


SohUion. AaauminE a Poisson distribution with p - 0.005 and n •> 100, one finds 
X — np ~ 0.5. Thus it is reasonable to use a Poisson distribution. In this case, 

P(X >k) < 0.10 
P(X < fc - 1) > 0.90 

k 

V ^ 

L, (k - 1)1 - 
Jb-1 
k 

e-o.6o.5fc-i 




ik - 1)! 


> 0.90 


From a Poisson distribution table one finds that 

A: _ 1 = 1 k ^2 

2-20. Expected Value of a Random Variable. Consider a discrete 
single-variate random variable X and its associated probability function: 


[XifX2f • ■ . 
lPbP2, . . . ,Pn] 


If the random experiment under consideration is repeated a large number 
of times, the average or mean value of the numerical function X is found 
to be 


Average of X = X = ^ pkXk (2-144) 

fc-i 

For example, for the experiment of rolling an honest die, one finds 

More generally, if ^(X) is a function of a random variable X (also called 
a weighting function) y the mean value of ^(X) is defined as 

Mean of ^(X) = ^ = V y^^Xk) (2-145) 

/b-i 


In the literature of probability, the mean of a function is generally 
referred to as its expected value. An alternative notation for denoting the 
mean value of a random quantity is a capital E in front of that quantity, 
for instance, E{X) or E{X + Y) or i?(2X + X«). When the function 
!^(X) is of the form ^(X) = X^ where j is a positive integer, its expected 



68 DISCRETE SCHEMES WITHOUT MEMORY 

value is called the moment of the jth order of X. For example, 

E{X) = X = first-order moment of X = ^ p^Xk 

A-l 

n 

E{X^) = X^ = second-order moment of X = ^ pkXk^ (2-146) 

A: = l 
n 

E{X^) = X® = third-order moment of X = ^ pkXk^ 

k = l 


The physicjil significance of moments is not discussed here. At present 
the reader is required only to acquaint himself with the concept of Eq. 
(2-145), that is, how the means of different weighting functions can be 
calculated. The concept of averaging is of considerable importance in 
engineering problems. For example, assume that X is a random voltage 
applied as the input to a device with an input-output relationship 

Y = ^(X) 

Then E{Y) is the d-c level for the output of the system. Similarly, if 
Y is applied across a unit resistor, the power consumed in the resistor, 
measured with respect to its d-c level, will have the same numerical value 
as the second moment of the random variable (F — F), that is, the expec- 
tation of 

{^(X) -£;[^(X)]p (2-147) 

There are at least three special weighting functions of particular interest 
in probability and information theory. These are 

X^ j = 1, 2, 3, . . . 

e = base of natural logarithm 

logX 

Without discussing the details at this time, we merely point out the most 
important application feature of each of the above functions : 

£(X0 This gives moments of different orders of X. 

E{e^) When this mean is known, one can find the values of dif- 

ferent moments without recourse to direct computation. 
E{— log X) In the following chapter it will be shown that, when X is 
taken to be the probability function /(x), the new random 
variable [— log/(x)] presents the amount of uncertainty 
associated with the occurrence of each outcome of the 
discrete experiment. Therefore, its mean value will 
stand for the average uncertainty of the system under 
consideration. 



BASIC CONCEPTS OP DISCRETE PROBABILITY 


69 


The concept of averaging can be generalized in a direct manner to 
weighting functions of n random variables associated with an n variate. 
For example, in the case of a bivariate random variable [X,F] and a 
weighting function we have 

E[HX,Y)] = X S (2-148) 

J k 

PROBLEMS 

2 - 1 . Determine whether or not the following relations are (lorrect (the primes 
denote the complements): 


(o) 

(A 

4- B){A -h C) = A 4- BC 

(b) 

(A 

+ B) - B = AB' 

(c) 

A'B = A + B 

{d) 

(A 

- AB)C' = A(B + Cy 

{e) 

(A 

+ Bye = A’B'C 

(/) 

(A 

+ B)(B + C)(C + A) = AB + AC + BC 

(tf) 

(A 

r\ B)n(B'r\c) = 0 


2 - 2 . Let A, B, C be three arbitrary events of a sample space. Find the expressions 
for the following cases: 

(a) At least one of the three events occurs. 

(b) B occurs and either A or C occurs, but not both. 

(c) Not more than two occur simultaneously. 

2 - 3 . Consider the set of points S = | (j-,?/) ) shown in Fig. P2-3. 



(a) Find the subset a = {{x,y)\x^ + 2 /^ < 41. 

(b) Describe the subset b = \(Xyy)\y <x^\. 

(c) Describe the subset c = \(x,y)\x < y^\. 

(d) Describe the subset b c. 

(c) Describe the subset (6 U a)c'. 

2 - 4 . Given a set 5 - [0,1,2,3,4,5,6,7,8,9,101, 

(a) Define the function Fiix) = x/2 over S and draw its graph. 

(b) Define the function F^^x) = x 4- 3 over 8 and draw its graph. 

(c) Determine the subset a « |x|(x/2)(x 4-3) <4}. 



70 DISCRETE SCHEMES WITHOUT MEMORY 

8-S. Show the following identities and draw the corresponding Sheffer^troke 
diagrams. 

(o) (^i(y|y)) -x'UY. 

(b) (XI(XIX)) - u. 

(c) Verify the identity of the expression for the output as given in Fig. P2-6a and b. 


X 


y 



X 



y 




Fig. P2-5 

2 - 6 . If .^ 2 , . . . t An are independent events, show that 

P[Ai + + • ■ • + AnI * 1 - P|A;}P1A;| ■ P\A'J 

2-7. Two cards are drawn at random successively, the first being replaced before 
the second is drawn. What is the probability of the first being a club and the second 
not a queen? 

2-8- Two dice are thrown. Denote by A the event that the sum of the faces is 
even and by B the event that their difference is even Describe the events A + 
ABf A'B*f AB\ and A' + H and find their probabilities. 

2-9. Given five letters a, 6, c, d, e, in how many different ways can one write three- 
letter words without repeating any letter (a) irrespective of their order and (b) con- 
sidering the order of letters? 

2-10. In how many different ways can a committee of four men and two women be 
selected from a total of 20 men and 10 women? ^ 

2-11. A survey of 1,000 people has indicated the following results: 714 listen to 
radio station A, 640 to station if, and 850 to station C. It also indicated that 530 
listen to both A and H, 375 to both C and H, and 720 to A and C. Determine whether 
these data are not self-contradictory. 

2-12. What is the probability of obtaining 8, 9, or 10 with two dice in one trial ? 

2-18. Two dice are thrown. Let A be the event that the sum of the faces is odd 
and B the event that at least one is a 1. Describe the events AH, A -I- H, AB' and 
find their probabilities. 

2-14. What is the probability of drawing a club or a face card of any color in a 
single draw from an ordinary deck of cards? 



BASIC CONCEPTS OF DISCBETE PROBABILITY 


71 


2-15. Two events A and B associated with an experiment have respective prob- 
abilities of occurrence p and q. Show that in n trials the probability that AB occurs 
Ki times; AB\ times; A*Bt Ki times; and A'B\ Kt times is 

2-16. Urn A contains seven silver dollars and one $10 gold coin. Urn B contains 
10 silver dollars. Nine coins are taken from B and put in A; then eight coins are 
selected at random from the 17 coins in A and put back in urn B. If you were to 
select one of the two urns, which one should you select? 

2-17. If the probability of a safe return from a certain trip is P = 0.9, what is the 
probability of exactly four safe returns out of six such trips? 

2-18. A single card is removed from a regular dock of cards. From the remainder, 
we draw two cards and observe that they are both diamonds. What is the probability 
that the removed card was also a diamond? 

2-19. Show that the two relay circuits of Fig. P2-19 are equivalent. 



Fi(i. r2-19 


2-20. Express the event of the functioning of the network in Fig. P2-20 in terms of 
the subevents Ei, Ez, , E%, where Ek implies the functioning of the fcth relay. 


2 3 



2-21. Two persons toss a coin n times each. What is the probability that they 
score the same number of heads? 

2-22. If a box contains 40 good and 10 defective objects, what is the probability 
that 10 objects selected at random from the box are all good? 

2-2S. What is the probability that in a bridge hand a player and his partner have 
a total of three aoes? 



72 


DISCRETE SCHEMES WITHOUT MEMORY 


2 - 24 . Assuming that the ratio of male to female children is 1 :2, find the probability 
that in a family of six children 

(o) All children will be of the same sex. 

(ft) The four oldest children will be boys and the two youngest will be girls. 

(c) Exactly half the children will be boys. 

2 - 26 . In a game of bridge, if a player has no ace, what is the probability that his 
partner has no ace either? 

2 - 26 . Find the probability that three, and only three, tails are obtained in a 
sequence of four tosses of a coin. 

2 - 27 . Assuming that the probability of each relay being closed is p, derive the 
probability for the flow of a current between nodes and /? of Fig. P2-27. 



Fuj. P2-27 


2 - 28 . Same question as in the preceding problem for the networks of Fig. P2-28. 



Fig. 


P2-28 


Hfn 


'5 "6 


2 - 29 . The following joint probability matrix is given for discrete random variables 
X and Y. Evaluate the marginal and the conditional probability functions. 

'yi2 H o~ 

0 H H 

2 - 80 . The joint density function for two random variables X and Y is given below: 

fiXfV) = for a; > 0; y >0 

f(x:,y) = 0 elsewhere 

Find P{X<1, y<l) 

2 - 81 . Evaluate the probability of getting a four 0, 1, 2, 3, 4, and 5 consecutive times 
on the throw of a die. 




BASIC CONCEPTS OF DISCBETE PROBABILITY 73 

2 - 32 . If the probability of hitting a target is in each shot, independent of the 
number of shots fired, 

(а) What is the probability of the target being hit twice in five shots? 

(б) What is the probability of the target being hit at least twice in five shots? 

2 - 33 . A book of 200 pages contains 100 misprints. Assuming that these are distrib- 
uted at random, estimate the chances that a page contains at least two misprints. 

2 - 34 . The random variable X assumes the values [0, 1 ,2] wit h respective probabilities 
The random variable Y assumes the values [0,1] wdth probabilities 
Assuming that the two variables are independent, determine their joint 
probability functions. 

2 - 36 . Study the different probability functions (joint, marginal, and conditional) 
associated with the following experiment. We draw five cards from an ordinary deck 
of cards and study the two random variables below: 

X, number of acf^s drawn 
r, number of queens drawm 

2 - 36 . A random event E has the probability of occurrence 1 /K in each experiment 
independently of the preceding outcome. Determine the following probabilities; 

(a) E does not occur in n consecutive trials. 

(ft) E occurs in the nth experiment only but not in any of the previous ones. 

(c) E occurs exactly twice in n experiments. 

(d) Let K = 4, n = 4 and evaluate the results of parts (a), (6), and (c). 

2 - 37 . Smith-J ones-Robinson Problem. The following problem has appeared in the 
Scientific American (vol. 200, no. 2, p. LIO, February, 1050) in an entertaining article 
entitled “Brain-teasers” That Involve Formal Logic. 



1. Smith, Jones, and Robinson are the engineer, brakeman, and fireman on a train, 
but not necessarily in that order. Riding the train are three passengers with the same 
three surnames, to be identified in the following premises by “ Mr.’' before their names. 

2. Mr. Robinson lives in Los Angeles. 

3. The brakeman lives in Omaha. 

4. Mr. Jones long ago forgot all the algebra he learned in high school. 

5. The passenger whose name is the same as the brakeman’s lives in Chicago. 

6. The brakeman and one of the passengers, a distinguished mathematical physicist, 
attend the same church. 

7. Smith beat the fireman at billiards. 

Who is the engineer? 




74 


DISCBETE SCHEMES WITHOUT MEMOBT 


Hint: The solution by methods of set theory may become somewhat cumbersome. 
It is suggested in the above reference to use two matrices as notational aid. Each cell 
is the intersection of two setSi corresponding to the set of elements contained in the 
pertinent column and row. Put a 1 or a 0 in a cell indicating that such an intersection 
is a valid premise or not. 

2 - 38 . Eddington '« Controversy. The following problem exemplifies the type of con- 
fusion that existed in probability prior to the introduction of set-theory considerations. 

If Af B, Cf D each speak the truth once in three times (independently), and A 
affirms that B denies that C declares that D is a liar, what is the probability that D 
was speaking the truth? 

The following comments on Eddington’s problem are given in an article entitled > 
“Brain-Teasers” That Involve Formal Logic by M. Gardner (op. cit.). ‘j 

“Eddington’s answer of was greeted by howls of protest from his readers, 
touching off an amusing controversy that was never decisively resolved. The English \ 
astronomer Herbert Dingle, reviewing Eddington’s book in Nature (Mar. 23, 1935), 
dismissed the problem as meaningless and symptomatic of Eddington’s confused 
thinking about probability. Theodore Sterne, an American physicist, replied (Nature ^ 
June 29, 1935) that the problem was not meaningless but lacked sufficient data for a 
solution. Dingle responded {Naiurey Sept. 14, 1935) by contending that, if one 
granted Sterne’s approach, there were enough data to reach a solution of exactly 
Eddington then reentered the fray with a paper entitled The Problem of A, B, C 
and D (Math. Gaz.y October, 1935), in which he explained in detail how he had calcu- 
lated his answer.” 

The difficulty lies chiefly in deciding exactly how to interpret Eddington's state- 
ment of the problem. If B is truthful in making his denial, are we justified in assum- 
ing that C said that D spoke the truth? Eddington thought not. Similarly, if A is 
lying, can we then be sure that B and C said anything at all? Fortunately we can 
side-step all these verbal difficulties by making (as Eddington did not) the following 
assumptions: (1) All four men made statements. (2) A, By and C each made a 
statement that either affirmed or denied the statement that follows. (3) A lying 
affirmation is taken to be a denial, and a lying denial is taken to be an affirmation. 

2 - 39 .* If a stick is broken at random into three pieces, what is the probability that 
the pieces can be put together in a triangle? 

Hint: The problem, despite its apparently clear statement, is ambiguous. It 
requires some additional information about the exact method of breaking the stick. 
The following two explanations are given in the cited reference. 

“One method is to select, independently and at random, two points from the points 
that range uniformly along the stick, then break the stick at these two points. If 
this is the procedure to be followed, the answer is H. and there is an elegant way of 
demonstrating it with a geometrical diagram. . . . 

“Suppose, however, that we interpret in a different way the statement ‘break a stick 
at random into three pieces.’ We break the stick at random, we select randomly one 
of the two pieces, and we break that piece at random. What are the chances that 
the three pieces will form a triangle? If after the first break we choose the smaller 
piece, no triangle is possible.” 

The latter interpretation of the problem gives H for the required probability. 


* This problem and its solution have appeared in the article Problems Involving 
Questions of Probability and Ambiguity by M. Gardner (Sci. Americany October, 
1959, pp. 174-176). Gardner’s article contains several other examples of ambiguity 
which have puzsled even some well-known mathematicians. 



BASIC CONCEPTS OP DISCRETE PROBABILITY 75 

2<-40. The joint probability matrix of two variables is given below. Determine 
whether they are statisticaUy independent. 

3rK2 lU ^ 2 “ 

2 Hr 

1 LH2 H Ha. 

y/x I 2 3 

2-41. Two urns contain four white and three black balls and three white and seven 
black balls, respectively. One urn is selected at random and a ball is drawn from it. 
What is the probability that this ball is white? 

2-42. A Markov chain has the transition probability matrix given below: 

‘0 K H “ 

0 H 2 Hi 
10 0 _ 

The three states are initially selected with probabilities li, H- 

(а) What is the probability of reaching state 2 via state 1 in one step? 

(б) What is the probability of reaching state 2 via 1 in two steps? 

(c) What is the probability of reaching state 3 in two steps? 

2-49. Define the probability function for the number of boys in a family of six 
children, assuming that both sexes are equiprobable and no multiple birth occurs, 
2-44. From the joint probability matrix below, 



0 



2/2 

H2 

% 

4h 

Vl 

.3^6 

'4 

0 


J-1 

Xs 

X. 


compute and tabulate : 

(а) Marginal probability Pi|x*|. 

(б) Marginal probability F2(2 /j 1. 
(c) P{yj\xk\. 

id) PMy,l 



CHAPTER 3 


BASIC CONCEPTS OF INFORMATION THEORY: 
MEMORYLESS FINITE SCHEMES 

The object of this chapter is to present the basic elements of informa- 
tion theory of discrete schemes in a manner parallel to the presentation 
of the elements of discrete probability theory. Our immediate aim is to 
develop a measure for information content of a discrete system. That 
measure will then be used for evaluating the rate of transmission of 
information in a communication system. No effort will be made to 
expound on the philosophical context of terms such as “information 
measure” or “communication.” In order to grasp a basic understand- 
ing of this newly developed scientific field, it seems desirable to confine 
ourselves to an accurate abstract mathematical model rather than to 
deal with generalities of a semiphilosopliical nature. The following 
approach is suggested : 

We shall consider a discrete random experiment and its associated 
sample space 12. Let X be a random variable (a real numerical function) 
associated with 12; we know that, say, K{X) has a particular physical 
meaning in regard to the random experiment. That is, if the experiment 
is repeated a large number of times, the values of X when averaged will 
approach E{X). In summary, E{X) has given a certain “physical” 
indication about the experiment. Similarly, has a certain sig- 

nificance in our studies. Then the (|uestion arises, could we search for an 
indicative number associated with the random experiment such that it 
provides a “measure” of surprise or unexpectedness of occurrence of 
outcomes of the experiment? Shannon has suggested that the random 
variable — log P\Ek] is an indicative relative measure of the occurrence 
of the event Et. In particular, he shows that the mean ofthis function is 
a good indication of the average uncertainty with respect to all the out- 
comes of the experiment. 

The reader should note that the above terms in quotation marks are used 
here with their common meaning. Their more accurate technical mean- 
ing wiU be defined later. 

3-1. A Measure of Uncertainty. Consider the sample space 12 of 
events pertaining to a random experiment. We partition the sample 
space in a finite number of mutually exclusive events Ek, whose proba- 

76 



BASIC CONCEPTS OP INFORMATION THEORY 


77 


bilities pk are assumed to be known (Fig. 3-1). The set of all events 
under consideration can be designated as a row matrix [E] and the cor- 
responding probabilities as another 
row matrix [F]. 

[E] = [Ei,E2, . . . ,En] 

with \J Ek = U (3-1) 

[P] = [PhP2, ■ ■ ■ ,Pn] 

with ^ Pk = i (3-2) 

A: = l 

Equations (3-1) and (3-2) con- 
tain all the information that we 
have about the probability space 
which is called a complete finite scheme. For example, the following 
matrix represents such a situation: 

r r El E2 E, 1 

[p\ [0.2 0.5 O. 3 J 

The fundamental problem of interest is to associate a measure of sur- 
prise or uncertainty, //(??i,p 2 , . . • ,Pn), with such probability schemes. 
Of course at this point it is questionable what is meant by a measure of 
uncertainty. The clarification of this concept has to come gradually; 
it is, in essence, the central theme of information theory. The problem 
can be approached in either of two, not necessarily exclusive, ways : 

1. First postulate the desired properties of such an uncertainty meas- 
ure; then derive the functional form of //(pi,p 2 , . ■ . ,Pn). The postu- 
lation of the desired properties can be based on some intuitive approach, 
such as physical motivation or ‘^usefulness” for some purpose, but after 
such a postulate is adopted, mathematical discipline must prevail and no 
further intuitive approach may be employed. 

2. Assume a known functional /7(pi,p2, - . ■ ,Pn) associated with a 
finite probability scheme and justify its “usefulness” for the physical 
problems under consideration. 

Our present approach is primarily of type 2. The more mathe- 
matically inclined readers who prefer an axiomatic approach are referred 
to Sec. 3-19 or Feinstein (I). 

Shannon and Wiener have suggested the following measure of uncer- 
tainty or entropy associated with the sample space of a complete finite 
scheme. 

mx) = - I pAogpi 

* All logarithms are to base 2 unless otherwise specified. 



( 3 - 3 )* 




78 


DISCRETE SCHEMES WITHOUT MEMORY 


where p* is the probability of the occurrence of the event Ei as described 
in Eqs. (3-1) and (3-2). The base of the logarithm is rather arbitrary; 
however, for communication problems it is convenient to use the binary 
base. 

Our immediate plan in this chapter is first to investigate the principal 
properties of this suggested measure of uncertainty and to justify its 
^^usefulness^' with respect to statistical problems of communication 
systems. Next we shall generalize this concept to two-dimensional 
probability schemes, which provide simple models for communications. 
Finally the discussion will be directed toward more general n-dimensional 
probability schemes. We shall always be restricted to complete systems 
of events; that is, we assume that Eqs. (3-1) and (3-2) are satisfied. 

3-2. An Intuitive Justification. In this section we wish to justify the 
usefulness of the function suggested in Eq. (3-3) in connection with com- 
munication problems. In problems dealing with communication systems, 
it is often instructive to regard a finite exhaustive probability scheme as 
a mathematical model for a communication source. In this analogy, any 
elementary event or outcome, Eky may be considered as a letter of the 
cUphabet of the communication transmitter. 

Now consider the random variable 

X = - log p (3-4) 

defined over the sample space of Fig. 3-1. To each event Ek there cor- 
responds a value Xk of the random variable X, where by hypothesis 

Xk ^ — log P[Ek] = — log pk (3-5) 

The quantity — log pk is frequently called the amount of self-information 
associated with the event Ek : 

[(Ek) = - log Pk (3-6) 

The unit of the amount of information is called a bit, where one bit is the 
amount of information associated with the selection of one of two equi- 
probable (p* = )^^) events. In other words, if the sample space is par- 
titioned into two equally likely events Ei and £ 2 , then 

I{Ei) = /(E 2 ) = - log H = 1 bit (3-7) 

A selection between two equally likely events requires one unit of 
information. If (2 were partitioned into 2^ equally probable events 
JB* (fc = 1, 2, . , , , 2^), then the self-information associated with any 
event Ek would be 


I(Ek) - - log p* = - log 2*"^ = N bits 


(3-8) 



BASIC CONCEPTS OF INFORMATION THEORY 


79 


The generalization from equiprobable events to the general case is 
straightforward. In fact, in order to evaluate the self-information 
associated with a particular event Eo, we divide the 0 space in two parts 
Et and E'^-, thus 

7(£o) = — log p(£o) = — log po bits (3-9) 

For instance, if po = Ksi the occurrence of Eo in the average conves^s 
to us 4 bits of information. The measure of self-information is essen- 
tially nonnegative : 

I{Ek) = - log pt > 0 (3-10) 

The equality is only by the certain event; obviously, no information is 
conveyed by the knowledge of the occurrence of such an event. 

The average amount of information or entropy of a finite complete 
probability scheme is defined by 


H(X) = HE,) = - I P, log p* (3-11) 

where the random variable X is defined over the sample space of events 
fl and the events satisfy Eqs. (3-1) and (3-2). H{X) is the average 
amount of self-information per event, the average being taken over the 
entire sample space. In fact, if — log pk indicates the measure of uncer- 
tainty associated with the event Ek, then H{X) will clearly represent the 
mean or the expected value of the uncertainty associated with our proba- 
bility scheme. As a simple example, let us consider the following three 
sets of complete events and compare their entropies. 


(I) 

E = [AM 

P = [H56,25%56] 

(11) 

E = [Bi.Bj] 

p = [y 2 ,H] 

(III) 

E = [CM 

P = UAe.Vis] 


The average self-information associated with each of these schemes is 
given respectively by 

(I) 1 1 = — (J '^56 log )'^56 + ^^^^56 log ^^^ 256 ) = 0.0369 bit 

(II) J 2 = — (J^ log log J^) = 1 hit 

(III) /s = — (Ke log Ke + Ke log = 0.989 bit 

In system I it is relatively easy to guess whether A i or A 2 will occur. 
In system III this guess is much harder, and in II it is most difficult to 
predict the occurrence of one of the events Bxor Bi. It is common sense 
to attribute a larger average uncertainty to system II than to system III 
and a larger average uncertainty to system III than to system I. This is 
in agreement with the results obtained by application of the chosen self- 



80 DISCRETE SCHEMES WITHOUT MEMORY 

information function, that is, 


I\ < < I2 

The average uncertainty associated with II is far more than that associ- 
ated with I. For I, we are almost sure that A 2 generally occurs. For 
II, the average uncertainty is larger, as it is most difficult to say whether 
Bi or B 2 occurs. 

3-3. Formal Requirements for the Average Uncertainty. Shannon’s 
approach, as well as several other authors’, in suggesting a suitable H 
function has been to some extent directed toward an axiomatic descrip- 
tion of such functions. The desired H function should have the following 
basic properties; 

1. Continuity. That is, if the probabilities of the occurrence of events 
Ek are slightly changed, the measure of uncertainty associated with the 
system should vary accordingly in a continuous manner. 

HiVhVh ■ • ■ yVn) continuous in Pit = 1, 2, . . . , n (3-12) 

0 < Pit < 1 

This requirement is obviously in conformity with our physical senses, 
since a slight change in the probability of the occurrence of an event 
should not provide us with a significantly large amount of information. 

2. Symmetry. The H function must be functionally symmetric in 

every p^. Indeed, the measure of uncertainty associated with a complete 
probability set must be exactly the same as the measure associ- 

ated with the set [E'^^E,]. Our measure must be invariant with respect 
to the order of these events. 

^(Pl,P2, . . . ,Pn) = //(P2,P1, . . . ,Pn) (3-13) 

3. Extremal Property. When all the events are ecjually likely, the 
average uncertainty must have its largest value. In this case, it is most 
uncertain which event is going to occur. Conversely, once we know 
which specific event among a number of n cf|ually likely events has occurred, 
we have acquired the largest average amount of information relevant to 
the occurrence of events of a universe consisting of n complete events. 

Maximum of Hipi,p 2 , . . . ,p„) = h(\-, . . . , (3-14) 

\n n / 

4. Additivity. Suppose that we have obtained a suitable measure of 
the average uncertainty i/(pi,p 2 , . . . ,p„) associated with a complete 



BASIC CONCEPTS OF INFORMATION THEORY 


81 


set of events. Now, let us assume 
that the event En is divided into dis- 
joint subsets (Fig. 3-2) such that 

m 

E„ = kjF, (3-15) 

k-l 

m 

Pn = £9. P{FA=qk (3-16) 

Evidently, the occurrence of the 
event En can be considered as 
another total sample space where 
the probabilities associated with evei 



Fkj. 3-2. A partitioning of the proba- 
bility space illustrating the additive 
property of the information measure. 


Fk can be normalized in the form 


+ «!" = 1 ( 3 . 17 ) 

Pn Pn Pn 

[This recourse provides a rather convenient relative frame of reference. 
That is, we call the event En a sample space fir, associated with the experi- 
ments of obtaining all events Fk (/<’ = 1,2, . . . , m), when we know that 
En is bound to occur.] Therefore we have three probability spaces and 
hence the following three FT functions: 


. . . ,Pn) 

il2{PhP2, . . . ,Pn-i,gi,g2, 


H 


( 31 31 

XPnW ' ' ' Pn) 


. . . ,(Zm) 


(3-18) 


A suitable additive or linear measure which also satisfies our common 
sense is given by 

^2 = //i + PnH, (3-19) 


The occurrence of the weighting factor pn in this linear form is rather 
anticipated. However, the uninitiated reader will find the examples of 
the following section helpful in illustrating this point. 

Complying with properties 1 to 4 given above, or with similar require- 
ments, one should be able to derive a functional form for the desired 
uncertainty function. Such treatments have appeared in the work of 
Feinstein, Khinchin, Shannon, Schutzenberger, and others. Their 
findings are not too complicated, but for a detailed presentation much 
more space is required than is available in the present work. The 
following references to the literature are recommended for further 
reading. 

1. Fadiev assumes properties 1, 2, and 4 and, subsequent to several 
lemmas, proves that H must be of the form suggested in Eq. (3-11) except 
for a multiplicative constant. (See Feinstein [I].) 




82 


dischete schemes without memory 


2. Khinchin assumes properties 1, 3. and 4 and the fact that adding a 
null set to a complete set of events should not change its entropy, and he 
derives the form of Eq. (3-1 1) up to a positive constant multiplier. 

3. Schutzenberger [I] aims for a more general axiomatic search for a 
measure of information associated with a complete set of events. He 
shows that functions other than the Shannon-Wiener entropy of Eq. 
(3-11) may also be employed. An example of such a function is given in 
the work of R. A. Fisher.* It should be pointed out, however, that the 
Shannon-Wiener suggested form is certainly the simplest of all such 
forms. The present richness and depth of the literature of information 
theory are to a great extent due to the simplicity of the form of Eq. 
(3-11). 

3-4. // Function as a Measure of Uncertainty. In this section we shall 
present a treatment concerning the suggested measure of uncertainty. 
We have discussed that such a measure should obey the following 
requirements : 


H{phP 2 , . ■ . ,Pn) continuous in pk for all 0 < p* < 1 
H(pk, I - Pk) = H{1 - Pk, Pk) fc = 1, 2, . . . , n 

maximum of //(pi,P 2 , . . . ,Pn) = i/ . . . , ^ 

H(pi,P2, . . . iPn-hqi,q2, . . . ,gm) = H{pi,p2, . . . yPn-hPn) 

i q^ q”\ 

+ \¥.' P. ¥.) 


whci-e 


Pn = '2qk 
4-1 


(3-20) 

(3-21) 

(3-22) 

(3-23) 


* Aiscording to R. A. Fisher (Proc. Cambridge Phil. Soc., vol. 22, pp. 700-725, 1925), 
the quantity of “information” in a sample from a distribution with density fix) and 
mean m is defined as 




For example, for a normal distribution 

1 


f(x) 




exp 






6 In/ a? " m 
dm a* 


I 

/ - — Fisher's “information” per observation 



BASIC CONCEPTS OP INFORMATION THEORY 83 

In the following, we demonstrate that the function defined in Eq. 
(3-11) satisfies all these requirements. 

Property 1: Continuity, The entropy function if(pi,p 2 , . . . ,p«) is 
continuous in each and every independent variable pu in the interval 
]0, 1]. The proof follows directly. 

. . . ,Pn) = Pi log pi + P2 log p 2 + ■ • • + p„ log 
= Pi log Pi + P2 log P2 + • ■ • + P„-l log pn_l + (1 - Pi - P2 - • • • 
- Pn-l) log (1 - Pi - P2 - • • • - Pn-l) (3-24) 

Note that all independent variables pi, p 2 , . . . , Pn-i and also (1 — pi — 
P 2 — ■ • ■ — Pn-i) are continuous in jo, 1] and that the logarithm of a 
continuous function is continuous itself. 

Property 2: Symmetry. The entropy function is, obviously, a sym- 
metrical function in all variables. 

Property 3: Extremal Va'ue of the Entropy Function. We should like 
to show that the entropy function has a maximum when all the individual 
probabilities are equal. 

Pi = P2 = • • • = pn (3-25) 

This is in conformity with our intuitive feelings; i.e., in a system where 
all different states are equiprobable, our average uncertainty will be 
greatest (in other words, it is most difficult to predict which state is most 
likely to occur). 

We may arbitrarily select pn as a dependent variable depending on 
Pfc (A: = 1, 2, . . . , n — 1). In fact. 


dpk 

But 

Hence 


-2 

t-i 


dH dpi 

^p^ dpk 




Pn = 1 — (Pl + P2 + • • • + Pt + • • ■ + Pn-l) 


dH 

dpk 


- (logi e + log pifc) + (logs e + log p„) 


dpk 




yields 



(3-26) 

(3-27) 

(3-28) 

(3-29) 

(3-30) 


Since p* was chosen arbitrarily, we come to the conclusion that, for an 
extremal point of the H function, we must have 

Pi = Ps = • • • = P" = ^ (3-31) 

n 

It remains to be shown if the latter relation makes the H function a 



84 DISCRETE SCHEMES WITHOUT MEMORY 

maximum and not a minimum. For this we note that 


But 


H(1,0,0, . . . ,0) = 0 


H 




= log n > 0 


(3-32) 

(3-33) 


Thus when all the mutually exclusive events are equiprobable, the H 
function reaches its maximum value. 

Property 4: Additivity. We prove the validity of this property by 
reducing the left member to a form identical with the right member of 
Eq. (3-23) : 

^ • ■ ■ j^m) 

n — 1 m 

= - ^ Pi log Pt - y g* log gt 

* =-- 1 A : = 1 

« m 

= - X P* Vk + P» log p„ - y gi log gk 

*=1 A =1 

rn 

= //(P1,P2, . . . ,p„) + Pn log p„ — ^ g* log Qk (3-34) 

A:=.l 

But 


Pn log Pn 


^ 9 k log Qk 

A-1 



(3-35) 


This proves the identity of the two sides of Eq. (3-23). 

It is to be noted that, since H functions are essentially nonnegative, 
we have 


^(PljPSj ■ ■ ■ iPn—\y(l\jQ2} • ■ • >^m) ^ lI{jP\yP2) ■ ■ ■ ,Pn— l,Pn) (3“36) 

That is, the partitioning of events into subevents cannot decrease the 
entropy of the system. 

Example 3-1 

(a) Evaluate the average uncertainty associated with 
the sample space of events shown in Fig. E3-1. 

PI 1 = 

(b) Evaluate the average uncertainty pertaining to each 
of the following probability schemes. 

[A,M ^ BU CllB I M, Cl M] 



Fig. E3-1 




BASIC CONCEPTS OF INFORMATION THEORY 


85 


(c) Verify the rule of the additivity of the entropies. 

(Solution 

(a) = KsilS log 5 + 12 log 3 - 32) bits 

(f>) mi.^) = MsdS log 6 - 24) bits 

= KsdS log 3 - 10) bits 

(c) It is a matter of numerical computation to verify that 

= h{h,%) + 

Example 3-2. Verify the rule of additivity of entropies for the following prob- 
ability schemes (Fig. E3-2a). 



Fiu. E3-2 

(а) [A,B,C,D] (Fig. E3-2b). 

(б) [A, A'] [B\A' , C\A\ D\A'] (Fig. E3-2c). 

(c) (Fig. E3-2d.) 

Numerical example: 


r Event 1 

- FA 

B C D1 

L Probability J 

IH 

k k kJ 


Solution. The object of the problem is to demonstrate that the average uncertainty 
m a system is not affected by the arrangement of the events, as long as the probabilities 
of the individual events do not change. 

(®) H * —P.4 log Pa — Pb log Pb — Pc log Pc - Pd log Pd 

where P^ * Pb = Pc = H Pd ^ H 

H = -H log H - M log H - H log H-H log H 
= log 2 -I- M log 4 -f log 8 H- H log 8 

« 1^ bits 




86 


DISCRETE SCHEMES WITHOUT MEMORY 


(6) Acoorditig to the additivity property [Eq. (3-19)] of the H functions, 
H- [-Pa log Pa-(1- Pa) log (1 - Pa)] - (1 - Pa) (j^ 


In, Pb 
Pa ‘ * 1 - Pa 


I Pc , Pd Pd \ 

1 - Pm 1 - Px + 1 - ^ ® 1 - 


-Pa log Pm - (1 - Pm) log (1 - Pm) - Pa log 


1 - Pa 

- Pc log - Pv log ^p- 

—Pa log Pm — (Pa + Pc + Pd) log (1 — Pm) — Pa log Pa + Pa 

log (1 - Pm) - Pc log Pc + Pc log (1 - Pm) - Pd log Pd + Pd log (1 - Pm) 

—Pm log Pm — Pa log Pa - Pc log Pc — Pd log Pd 


where Pa ~ H Pa = 14 Pc = H Pd - i’i 


H ~ -]4 log log M - MiM log M + Va log M + H log Vi) 

- H log 2 + H log 2 + H(H log 2 + log 4 + Va log 4) 

= H + M + H) 

= bits 

(f) H ~ -{Pa + Pa) log (Pm + Pa) - (Pc + Pd) log (Pc + Pd) 

+ {Pa + Pa) ( - ‘“K Pm +Pa “ PlT^ P7^^) 

+ (Pc +Pa) (- Pp +Pa‘°®Pc +’Pa ”Pc^Pa*°®Pc +’Pa) 

“ — (^A 4- Pb) log (Pa + Pb) — (Pc + Pd) log (Pc 4 Pd) 

“ Pm +Pa ~ '°*Pm + Pa ~ ‘"^Pc +Pd “ *°*Pc +Pd 

= - (Pm + Pa) log (Pm + Pa) - (Pc + Pc) log (Pc + Pc) 

— Pm log Pm + Pm log (Pm + Pa) — Pa log Pa + Pa log (Pm + Pa) 

— Pc log Pc + Pc log (Pc + Pc) — Pc log Pd + Pc log (Pc + Pc) 
“ —Pm log Pm — Pa log Pa - Pc log Pc — Pc log Pc 

where Pa = M Pa = M Pc = H Pd - H 


H - -{%) log H log M + %(-% log - H log M) 

+ M(-HlogM - HlogH) 

- log 3 + log 4 + log 4 + log 2 + log 3) 

+ H(H log 2 + log 2) 

- log 3 + ?^ + + log 3) + Ji(H + H) - 

- log 3 + >4 - log 3 + 

- bite 

3-6. An Alternative Proof That the Entropy Function Possesses a 
Maximum. The Shannon-Wiener theory of information is strongly 
linked with the logarithmic function. Thus it is desirable to spend some 
time investigating some of the basic mathematical properties of the 
logarithmic function. Such mathematical presentations may seem dis- 
tant from an immediate engineering application; however, they are of 
prune significance to those who are interested in basic research in the field. 



BASIC CONCEPTS OF INPOBMATION THEORY 


87 


First we shall prove a lemma on the convexity of the logarithmic 
function. Then the lemma will be employed in giving an alternative 
proof for property 2 of the previous section. 

Lemma 1. The logarithmic function is a convex function. 

The reader will recall that a function of the real variable y = f{x) is 
said to be convex upward in a real interval if for any xi and in that 
interval one has 

H[/(xi) + /(xOI < / (3-37) 

Geometrically this relation can be simply interpreted by saying that the 
chord connecting points 1 and 2 lies below the curve. An equivalent 
definition can be given for a curve that is convex upward in an interval. 
That is, 

a/(^i) + (1 — o)/(x2) < S[axx + (1 “ a)(a;2)] 0 < o < 1 (3-38) 

The geometrical interpretation of Eq. (3-38) is that in the interval under 
consideration the chord lies everywhere below the curve (see Fig. 3-3a). 




Fig. 3-3. (a) An upward convex function. (6) Logarithmic function is upward 
convex. 


A necessary and sufficient condition for y = /(x) to be convex on the 
real axis is that 

2 s » 

for every point of the real axis, provided that the second derivative exists. 
This requirement is satisfied for the function 

2 / = In X 

dx* X* 

forO<*<« 


In fact, 


(3-40) 

(3-41) 

(3-42) 


88 


DISCRETE SCHEMES WITHOUT MEMORY 


Note that this property is independent of the base of the logarithm as 
long as the base is a number greater than unity : 

* In X = In 2 * log 2 x (3-43) 

Thus we have shown that for positive values of Xi and X 2 

HGog + log X,) < log (3-44) 

(3-45) 

The geometric mean of two positive numbers is smaller than their 
average.* 

An alternative formulation of Eq. (3-38) can be given by using the 
following equivalent criterion for convex functions, t If f{x) is convex on 
the real interval a < x < h, then for any three values of x, a < xi < 

X2 < X2 < hj 

I Xi f(Xi) 1 1 

X2 f{x2) 1 < 0 (3-46) 

I X3 f lxi) 1 I 

Lemma 2. For any positive number we have 

In X < X - 1 (3-47) 

This is a simple conclusion of the convexity of In x. Evidently, the 
tangent at point x = 1 is above the logarithmic curve (Fig. 3-36). The 
equation of the tangent to the curve at x = 1 is given by 

?/e = X — 1 (3-49) 

In X < X — 1 (3-50) 

Again note that this property is equally true for the logarithmic function 
of the base 2, i.e., 

log X = In X log e < {x — 1) log e 
* This statement can be extended to the case of n positive numbers^that is, 



t A discussion on convex functions is generally included in books on advanced 
calculus. Those interested in further reading on the subject of convexity may refer 
to Hardy, Littlewood, and Polya. See also G. W. Medlin, On Limits of the Real 
Characteristic Roots of Matrices with Real Elements, Proc. Am. Math. Soc., vol. 7, 
pp. 912-917, or G. Julia, “Les Principes g^om^triques d’analyse, ” Gauthier- Villars, 
Paris. 



BASIC CONCEPTS OF INFORMATION THEORY 


89 


The above lemma will be of some use in our future work. At present, 
we may employ it to give an alternative proof for the fact that the average 
uncertainty is greatest when all the events are, equiprobable. In order 
to show this, assume that the space of x contains m points, not necessarily 
With equal probabilities. It is required to show that H{x) is smaller 
than the entropy of the equiprobable case, that is, 

H{x) <-m{^\og^ (3-51) 

or to prove 

H{x) < log m (3-52) 

But by definition, 

m 

H{x) — log m = ^ Pi log ~ ~ (3-53) 

1 

Since we are dealing with exhaustive systems, log (1/m) can be replaced by 

m 

1 

m m 

or H(x) - log m = p. log ^ P- log ^ (3-55) 

1 1 

m 

H{x) - log ^ P* log ^ (3-56) 

1 

Applying Lemma 2, we find 

m m 

H{x) - log ^ P. log ^ ^ P. - l) log c (3-57) 

1 1 

m 

H(x) — log m < logj ® ^ ~ P*)] =11 (3-58) 

1 

H{x) < log m ’ (3-59) 

The maximum entropy corresponds to the case when all m states have 
equal probabilities of occurrence p^ = 1/m. 

3-6. Sources and Binary Sources. In the study of probability one 
usually employs concepts of sets but uses certain terminology which 
differs from that of set theory. Examples of such terminology were 
given in Sec. 2-6. Similarly, information theory uses certain specialized 
terms which need to be translated into a more universally understood 



90 


DISCRETE SCHEMES WITHOUT MEMORY 


mathematical form. For our immediate use the following terms are 
defined : 

A source or transmitter is similar to the space of a random experiment. 
That is, a source is the assemblage of all possible events associated with 
the sample space of a complete random experiment. Each outcome of the 
experiment corresponds to an elementary output of the source and is 
called a symbol or a character or a letter. 

The finite alphabet of a communication source consists of all its finite 
distinct characters, much in the same way that the sample space consists 
of all possible elementary outcomes of a discrete random experiment. 

-A word = a specified 
sequence of letters 


-A letter, a symbol, or 
a character 



Fig. 3-4. A symbolic illustration of the message space of an independent source; words 
are specified as sequences of letters (with or without repetition). 

A finite sequence of characters may be called a word or a message in the 
same way that the sequence of a number of outcomes associated with the 
repetition of an experiment may be designated as an event. This is 
schematically illustrated in Fig. 3-4. When the probabilities of the 
selection of successive letters are independent, we say that the source has 
no memory. This chapter is devoted to the study of discrete schemes 
without memory. The study of sources with memory will be deferred 
until Chap. 11. 

A binary source is associated with the sample space of a random binary 
experiment when the experiment is repeated over and over. In lieu of 
saying that a random experiment has only two possible exclusive out- 
comes A and S, we adhere to communication terminology^nd say that a 
binary source has an alphabet of two letters A and B. The following 
three matrices summarize the information-theory performance of a binary 
source : 

Alphabet = {letters} = [AjB] 

Probability matrix [P] - [p, 1 — p] = [p,g] 

Self-information matrix [/] = [— log p, — log (1 — p)] (3-60) 

Average information per letter H ^ 1 = — p log p 

- (1 - p) log (1 - p) 




BASIC CONCEPTS OP INPORUATION THEORY 


91 


The communication entropy for such a system will be 

H(p) = -p log p - g log g = -p log p - (1 - p) log (1 - p) (3-61) 

A plot of the function H{p) in terms of p is shown in Fig. 3-5. The 
maximum of this function, as anticipated, occurs at p = for which 
the entropy becomes 1 bit per letter. If a transmitter is sending the two 
letters A and B with equal prob- 
abilities, the average information 
per letter is a maximum of 1 bit 
per letter. 

An interesting observation can be 
made here about the entropy of a 
binary source. That is, H{p) of 
Eq. (3-61) is a function concave 
downward (or convex upward). 

/'^[H(pi) H- H(j) 2 )] < // ( ^ of an independent 

\ " / binary source. 

(3-62) 

Suppose that we have three specific binary sources for communication 
between two stations. If we assume the pertinent probabilities for the 
first letters of each source to be pi, p 2 , and (pi + p^/2, the above state- 
ment tells us that the average uncertainty of the third source is larger 
than the mean of the other two. Loosely speaking, it is relatively more 
difficult to predict the transmission of the letters of the third source. 

For example, consider the following two binary sources si and § 2 . 

Pax = H Pa2 = M 
Pbi = H PB2 = H 

h{8,) = log H - H log H + log 3 

HM = log H-y4\ogH = 2-H log 3 

A third binary source with an average probability (p^i + Pa 2)/2 and 
(Pbi + pB 2 ) 12 per letter will have an average entropy per letter of 

Va = + 34 ) = y24 Pb = y2{% + H) = 1^4 

H{s) = -3^4 log 3^4 - 1J44 log 1^4 = 3 + log 3 - log 7 

- ^y24 log 17 

The average information per letter for the third source is greater than 
the mean of the average information associated with letters of the first 
and the second source. 

3*7. Measure of Information for Two-dimensional Discrete Finite 
Probability Schemes. In this section, we extend the definition of the 
measure of information from a one-dimensional to a two-dimensional 
probability schenae. The qontent of this section forms an important 




92 


DISCRETE SCHEMES WITHOUT MEMORY 


part of the basic concepts of information theory for several reasons. In 
the first place, the appropriate generalization from one-dimensional to 
two-dimensional can be considered as an induction rule for the deriva- 
tion of the information measure of any finite-dimensional probability 



id) ( 6 ) 

Fig. 3-0. (a) A sa-inplt* space E. {b) A sample space F. 


space. In the second place, the two-dimensional probability scheme 
provides the simplest mathematical model for an engineering communica- 
tion system, that is, a system with a '‘transmitter^^ and a "receiver^' or a 
transducer with in and out ports. Finally the concept of mutual informa- 
tion or transinformation which forms one of the fundamental concepts of 
information theory can be discussed in the light of this product space. 

Consider two finite discrete sample 
spaces fli, ^ 2 , and their product 
space as illustrated in Figs. 3-6 and 
3-7. In ill and il 2 we select complete 
sets of events in the sense of Eqs. 
(3-1) and (3-2). 

. . . ,En\ 

IF] = [Fi,F2, . . . .Fm] (3-63) 

Vrr. Q 7 ^ ’ Each event Ek of ili may occur in 

Fig. 3-7. Product space oi E ® F. . . , . r» r 

conjunction with any event Fj oi ih] 

thus the following events form a complete set of events in the product 



space iliQ 2 . 




E\F 2 

■ ■ E,F, 

{EF\ = 

EiFi 

E2F2 

■ ■ E^, 


E^i 

E1P2 

■ • EnF, 


(3-64) 


where EkFj stands for the simultaneous occurrence of the events Ek and 
F,. In this fashion, we are confronted with the following three complete 
sets of probability schemes: 


P{E} = [P{Ek]] 
P{F} = IP|F,)] 
P{EF} = [Pl£fcF,)] 


(3-65) 

(3-66) 

(3-67) 







BASIC CONCEPTS OP INFORMATION THEORY 


93 


No stipulation is made about the independence or dependence of the 
events Ek and Fj, Of course, each one of the above three schemes is, by 
assumption, a finite complete probability scheme. The data pertaining 
to this fact can be conveniently obtained from the joint probability 
matrix below. 


X 

'pfl.ll 

P{1,21 • 

• p{l, mf 

IX, Y] = 

P|2,l) 

P{2,2) • 

■ p{2,7n} 


_Pfw,ll 

p(n,21 • 

• ■ P{n,m]_ 


X and Y are random variables, associated with spaces fii and respec- 
tively, and (XjY) with the product space. The marginal probabilities of 
the two-dimensional random variables (X,F) yield the probabilities per- 
taining to each of the random variables X and Y. For example, 


PM 

= P{E^} 

II 

c 

2 U • - • 



= P{1,11 

+ p11,2} + ■ • ■ 

+ pll,»wl 

(3-69) 

P{y^] 

= P[F.\ 

3 

Pis 

II 

iU • • • yJF^E^ 



= P{1.2} 

+ p12,2) + • • ' 

+ p{n,2] 

(3-70) 



PM = X vM\ 

J = 1 

!/j 1 

(3-71) 



P[yA = X p(^*) 

uA 

(3-72) 


A; = l 


Thus we have three finite complete probability schemes, and naturally 
there are three corresponding entropies: 

= 

log p{l,l} - p|l,2} log p{l,2) - • • ■ - p{l,ml log p(l,m) 
-p{2,l} log p{2,l} - p(2,21 log p12,2) - • • • - p|2,m| log pl2,tn} 


— p{n,l} logp{n,l} — p{n,21 log pin, 2} — • • • — pjn.m} log pjn.Tn) 

(3-73) 


HiX) = 

-(p(l,l} -f- p{l,2} -I- • ■ • -1- pll,ml) log (p{l,l) + p(l,2) + ■ • ■ 

+ Pll.wj) 

-p(pl2,l} + p{2,2) -!-••■+ p{2,to)) log (pl2,l) + pl2,2} 

+ P{2,m)) 


~(p1«i 1} + p|n,2} -!-•••+ pl«,ffil) log (pln.l) -1- p{n,2) -!-••• 

+ p{»,wj) 
(3-74) 



94 


DISCBETB SCHEMES WITHOUT MEMOBT 


H{Y) = 

~(p{l|l} + p{2,l| + • ■ • + p\n,l]) log (p{l,l} + p{2,l) + • • ■ 

-(p|l,2) + p|2,2} + • . . + p(n,2j) log (p{l,2} + p|2,2} + ■ 

+ V{n,2\) 

-(p{l,m| +p|2,m) + ■ • • + p{n, mi) log (p{l, to} +p{ 2 ,to) + • • • 

+ p(n,m)) 
(3-75) 


The above three entropies can be expressed in a more condensed fashion 
(3-68)'"^ directly the two-dimensional joint probability matrix of Eq. 


^ ^ p{k,j] logp[fcj) 

A:>*1 j = l 

A = n j = »i 

^ ~ ^**^’^*) X 

j-m 71 k = n 


(3-76) 

(3-77) 

(3-78) 


entropy, H{X) the marginal entropy of X, 
and H{Y) the marginal entropy of Y. 

The marginal entropies can, of course, be directly expressed in terms of 
marginal probabilities plx*) and p{j/,), that is. 


k^n 

HiX) = — ^ p|x*} logp{xt| 

t-i 

H(Y) = - ^ p{y,\ log plyi] 

j-i 


(3-79) 

(3-80) 


The next section deals with conditional entropies associated with a 
discrete two-dimensional probability scheme. 

3-8. Conditional Entropies. Reference is made to thermatrix of Eq. 
(3-68) and Fig. 3-7; an event f), for example, may occur in conjunction 
with Ft, Ei, ... , 0 T E„. 


Fi = U E^Fi 

*-l 


or 


p[x . X.IK . y,\ . 

^ = vA 

Pl**! - ^ 


yi\ 


(3-81) 

(3-82) 

(3-83) 



BASIC CONCEPTS OP INPOHMATION THEORY 


95 


Now consider the following probability scheme: 


{E\FA = [E,\Fj,E,\F,, 


,Enm 

’ J 


(3-84) 

(3-85) 


The sum of the elements of this matrix is unity; that is, the probability 
scheme thus described in not only finite but also complete. Therefore 
an entropy may be directly associated with such a situation. 


n 



n 

^ p|-Kifc| 2 /j| log p(%| 2 /j| (3-86) 

k^l 


Now one may take the average of this conditional entropy for all admis- 
sible values of 2 /^, in order to obtain a measure of average conditional 
entropy of the system. 


//(X|F) = FTO = I plvAlHiXly^)] 

m n 

= - Y, p\y A X ^0gp\xAyA (3-87) 

k=l 

H{X\Y) = - £ p\yAp[^k\yA ^ogp{xk\yA (3-88) 

Similarly, one can evaluate the average conditional entropy H{Y\X)\ 

HiY\X) = - X S V\xk\p\yi\xk] \ogp\y,\xA (3-89) 

k = i i = l 

The two conditional entropies (the word “average” will be omitted for 
briefness) can be written as 


H{X\Y) = - X I logpl®*l2/,) (3-90) 

J -1 fc -1 
n m 

H{Y\X) = - X X logP(2/vl^*l (3-91) 

fc=l J-1 

The conditional entropies along with marginals and the joint entropy 
compose the five principal entropies pertaining to a joint distribution. 
All logarithihs ar^ taken to the base 2 in order to obtain units in binary 



96 


DISCRETE SCHEMES WITHOUT MEMORY 


digits. Note that all entropies are essentially positive numbers as they 
are sums of positive numbers. 

The physical interpretation of the different entropies will be discussed 
in the subsequent section. 

Example 3-8. Determine five entropies pertaining to the joint probability matrix 
of Example 2-30. 

Solution 

6 6 

H{X,Y) = ~ X S 
1 1 

6 

U(X) = H(Y) = - ^P.log = - log }^ = 1+ log 3 
1 

6 6 

H{X\Y) = //(FIX) = - = 1 + log 3 

1 1 

3-9. A Sketch of a Communication Network. In this section, we wish 
to present an informal sketch of a model for a communication network. 
In contrast to the material of the previous sections, the content of this 
section is not presented in a strict mathematical frame. The words 
source, load, channel, transducer, transmitter, and receiver are used in 
their common engineering sense. Later on, we shall assign a strict 
mathematical description to some of these words, but for the present the 
reader is cautioned against any identification of these terms with similar 
terms defined in the professional literature. 

In the study of physical systems from a systems engineering point of 
view, we generally focus our attention on a number of points of entry to 
the system. For example, in ordinary electric networks, we may be 
interested in the study of voltage-current relationships at the same port 
of entry in the network (Fig. 3-8a). This is generally known as a one-port 
system. 

When the voltage-current relationships between two ports of entries are 
of interest, the situation is that of a two-port system. In a two-port 
system, a physical driving force is applied to one port and its effect 
observed at a second port. The second port may be connected to a 
“receiver” or “load” (Fig. 3-86). Such a system is usually known as a 
two-port, or a loaded transducer. More generally, in many physical 
problems we may be interested in the study of an n-port network (Fig. 
3-8c). From linear network theory, we know that a complete study of 
systems requires a knowledge of transmission functions between 
different ports. For example, if we concentrate on different impedances 
of a network, the following matrices are considered for a general study of 
a oue-port, two-port, and w-port, respectively. 



BASIC CONCEPTS OF INFORMATION THEORY 


97 



(3-92) 


(The impedances are used in the ordinary circuit sense, Zkj being the 
transfer impedance between the kth and the jth port.) 

An equivalent interpretation can be made for the study of probabilistic 
systems. In fact, the systems point of view does not rely on the deter- 


1 2 



ic) 


Fig. 3-8. (o) A onc-port network, (b) A two-pori analog of a channel connecting a 
source and a receiver, (c) An n-port analog of a communication system consisting of 
several sources, channels, and sinks. 


ministic or probabilistic description of the performance. It is based on 
the ports of application of stimuli and observation of responses. For 
instance, consider a source of communication with a given alphabet. 
The source is linked to the receiver via a channel. The system may be 
described by a joint probability matrix, that is, by giving the probability 
of the joint occurrence of two symbols, one at the input and the other at 
the output. The joint probability matrix may be designated by 




P{xi,y2] ■ 

■ • P{xi,y„ 

[P|x,rj] = 

P{xt,yi] 

P{xi,yi] ■ 

• ■ P{xt,yn 




P[Xm}yn 


But in a product space of the two random variables X and Y there are 




98 DISCBETE SCHEMES WlTHOtJT MEMORY 

five basic probability schemes of interest. These are 

[P ( X* 7 ) ] i oint probability matrix 

[P{X}] marginal probability matrix of X 

[P{ y |] marginal probability matrix of Y 

[P { X| y I ] conditional probability matrix 

[P{ y|X)] conditional probability matrix 

Thus we are naturally led to five distinct functions in the study of a simple 
communication model. 

This idea can be generalized to n-port communication systems. The 
problem is similar to the study of an n-dimensional discrete random varia- 
ble or product space. In each product probability space there are a 
finite number of basic probability schemes (marginals and conditionals of 
different orders). With each of these schemes, we may associate an 
entropy and directly interpret its physical significance. 

A source of information is in a way similar to the driving source in a 
circuit; the receiver is similar to the load, and the channel acts as the 
network connecting the load to the source. The following interpretations 
of the different entropies for a two-port communication system seem 
pertinent. 

H{X) Average information per character at the source, or the 
entropy of the source. 

U{Y) Average information per character at the destination, or the 
entropy at the receiver. 

H{XjY) Average information per pairs of transmitted and received 
characters, or the average uncertainty of the communica- 
tion system as a whole. 

H{Y\X) A specific character x^ being transmitted; one of the permis- 
sible yj may be received with a given probability. The 
entropy associated with this probability sche me when 
covers sets of all transmitted symbols, that is, II(Y\Xt)j is 
the conditional entropy H{Y\X), a measure of informa- 
tion about the receiving port, where it is known that X is 
transmitted. 

ff(Xl y) A specific character being received ; this may be a result of 

transmission of one of the x^ with a given probability. 
The entropy associated with this probabili ty schem e when 
2 /y covers all the received symbols, that is, H(X|yi), is the 
entropy H{X\Y) or equivocation, a measure of informa- 
tion about the source, where it is known that Y is received. 

H{X) and ^r(y) give indications of the probabilistic nature of the 
transmission and reception ports, respectively, HiY\X) gives an indica- 


(3-94) 

(3-95) 

(3-96) 

(3-97) 



BASIC CONCEPTS OF INFORMATION THEORY 


99 


tion of the noise or error in the channel, and H{X\Y) indicates a measure 
of equivocation, that is, how well one can recover the input content from 
the output. 

All the probabilities encountered in the two-dimensional case can be 
derived from the joint probability matrix. Thus, a joint probability 
matrix specifies a communication channel, in much the same way that an 
impedance or admittance matrix specifies the performance of an ordinary 
linear two-port network with respect to its ports. 

3-10. Derivation of the Noise Characteristics of a Channel. In com- 
munication problems in general, the joint probability matrix is not given. 
It is customary to specify the noise characteristics of a channel and the 
source alphabet probabilities. P>om these data we can directly derive 
the joint and the output probability matrices. For example, the joint 
probability matrix is 


~P\xi\P[yi\xy\ 

P\x,\P{y^\x,\ ■ 

■ • P{xi\P\yn\xi\ 

P\x,\P[y,\x-A 

P\Xi\P{yi\xi\ 

■ • P|a;2|P{2/„|a:2| 

_P \Xn,\P\y l\Xn,\ 

P[Xm\P{yi\x.,\ ■ 



which can be written as 

l^(Z)][P(r|X)] = \p\x,Y\] 

(In this form we assume that the marginal probability matrix is written 
in a diagonal form.) 

Similarly, if for convenience is written in the form of a row 

matrix, we have 

[P{X1][P{F1X}] = [P\Y\] 

. . . ' 

where [P{ F}] will also be a row matrix designating the probabilities of 
the output alphabets. 

This section offers for discussion two particularly simple communica- 
tion channels: 

1. Discrete noise-free channel 

2. Discrete channel with independent input-output 

Discrete Noise-free Channel. In such chanriels, as their name indicates, 
every letter of the input alphabet is in a one-to-one correspondence with 
a letter of the output alphabet. The joint probability matrix, as well as 
the channel probability matrix, is of the diagonal form : 


v{xi,yA 

0 

0 ■ 

0 

v\xi,yi\ ■ 

0 

0 

0 



[P{XJ}] = 


(3-98) 



100 


DISCRETE SCHEMES WITHOUT MEMORY 



"1 

0 • 

• 0“ 

[p{jf|K)] = [P{y|x}] ^ 

0 

1 ■ 

• 0 


.0 

0 ■ 

• 1_ 


For a noise-free channel the entropies are 

H{X,Y) = H{X) = H{Y) = - X p(x., 2 /,} log p\x„y^ (3-100) 

1 = 1 

H{Y\X) = //(XI 7) = 0 (3-101) 

The interpretation of these formulas for a communication system is 
rather clear. To each transmitted symbol in a noise-free channel there 
corresponds one, and only one, received symbol. The average uncer- 
tainty at the receiving end is exactly the same as at the sending end. 
The individual conditional entropies arc all eciual to zero, a fact that 
reiterates a nonambiguous or noise-free transmission. 

Discrete Channel with Independent Input-Output. In a similar fashion, 
one can visualize a channel in which there is no correlation between input 
and output symbols. That is, an input letter Xt can be received as any 
one of the symbols yj of the receiving alphabet with equal probability. 
As will be shown, such a system is a degenerate one as it docs not transmit 
any information. The joint probability matrix has n identical columns. 




X 

Pi 

Pi ■ 

• Pi" 

[P{X,F1] = 

P2 

P2 ■ 

■ P2 


_Pm 

Pm 

Pm_^ 



( 3 - 102 ) 


The input and output symbol probabilities are statistically inde- 
pendent of each other, that is, 

P{a:.,2/,} = Pi{a:,lp2{2/,) (3-103) 


This can be shown directly by calculation : 


m 



1 


From this one concludes that 

v[xi\y,] = pi{x<} = np< 

v\yi\xi] = vAvA = \ 


( 3 - 104 ) 


(3-105) 

(3-106) 



BASIC CONCEPTS OF INFORMATION THEORY 


101 


The different entropies can be computed directly: 


T9& 

H{X,Y) = -n (I p.logp.) 

t = l 

m m 

H{X) = - npi log np. = -n ( ]£ p, log p.) - log n 

1 = 1 ,■ = 1 



H{X\Y) = - 2 npi log npi = H{X) 

i = l 


H{Y\X) = - ^ np. log i = log n = H{Y) 
1^1 


(3-107) 

(3-108) 

(3-109) 

(3-110) 

(3-111) 


The interpretation of the above formula is that a channel with inde- 
pendent input and output ports conveys no information whatsoever. 
To mention a network analogy, this channel seems to have the largest 
internal *‘loss,^^ like a resistive network, in contrast to the noise-free 
channel which resembles a ‘‘lossless” network. 

3-11. Some Basic Relationships among Different Entropies. In this 
section we should like first to investigate some of the fundamental 
mathematical relations that exist among different entropies in a simple 
two-port communication system and then point out their significance in 
communication theories. Our starting point is the evident fact that the 
different probabilities in a two-dimensional distribution (product space) 
are interrelated, plus the fact that the chosen logarithmic weighting 
function is a convex function on the positive real axis. We begin with 
the basic relationship that exists among the joint, marginal, and con- 
ditional probabilities, that is, 

p{^k,yj} = p{xi,\yA - ply,] = p{y,\xk} • vM (3-112) 
log V{xk,y,\ = log p[xk\y,\ + log p\y,] 

= log p\yj\xk] -I- log pi a:*) (3-113) 

The direct substitution of these relations in the defining equations of the 
entropies leads to the following basic identities : 

HiX,Y) = H{X\Y) -I- H{Y) (3-114) 

H(_X,Y) = H{Y\X) + U{X) (3-115) 

Next we should like to establish a fundamental inequality first shown 
by Shannon, namely, 


HiX) >HiX\Y) 


(3-116) 



102 


DISCRETE SCHEMES WITHOUT MEMORY 


For the proof of this inequality, we employ once again Eq. (3-50) for 
log {p[xk]/'p\xk\y}]). 

tn n 

Afwn - HW - ^ ^ 108 ^ 

y = 1 A = 1 

in n 

But the right side of this inequality is identically zero as 

m n m 


2 2 ~ log e = y {p[ijj\ - p\y,]) log c = 0 \ 

13=1 -lal ' 


kml 

j-i 

(3-118) 

Hence, 

H{X) > HiX\Y) 

(3-119) 

and similarly one 

shows that 



H{Y) > H{Y\X) 

(3-120) 


The equality signs hold if, and only if, X and Y are statistically inde- 
pendent. It is only in such a case 



Fig. E3-4 


that our key inequality Eq. (3-50) 
becomes an equality (at point x = 1), 
that is, 


p\xh\ 

p\xk\y,\ 


(3-121) 


for all permissible values of fc and j. 
This is the case of independence 
between X and F. 


Example 8-4. A transmitter lias an alpha- 
bet consisting of five letters \x\,X‘2,Xa,XifXi] 
and the receiver has an alpliabet of four 
letters {yuy2,yi,yi]^ Thtr joint probabilities 
for the communication are given below. 


See Fig. E3-4. 


yi Vi Vs y* 
arirO.26 0 0 0 “ 

Xi 0.10 0.30 0 0 

xa 0 0.05 0.10 0 

Xi 0 0 0.05 0.10 

Xi[o 0 0.05 0 ^ 


Determine the different entropies for this chaxinel. 



Solution 


BASIC CONCEPTS OF INFOBMATION THEOBY 


103 


fiixi) = 0.25 

fi{xi) = 0.10 + 0.30 - 0.40 
/i(x,) = 0.05 + 0.10 - 0.15 
- 0.05 + 0.10 -= 0.15 
fi(xt) = 0.05 


A(Vi) “ 0.25 +0.10 - 0.35 
/2 (Fj) - 0.30 + 0.05 - 0.35 
h(y») = 0.10 + 0.05 + 0.05 = 0.20 
hiVi) = 0.10 


Kx.\y,) = 

f2{yi) 


f(xi\y2) =» 


fixily^) =* 
f(x2\yi) = 




0^0 6 
0.35 ” 7 
0.10 _ 1 
0.20 “ 2 
0.10 _ 
aio “ ^ 
aio _ 2 
0.35 7 

0.05 ^ 1 
0.35 7 


fixilyi) = 


0.05 ^ 1 
0.20 4 




0.05 1 

0.20 “ 4 


0.25 

0.35 


6 

7 


f(yi\xi) 
f(y2\x2) 
fiyslxt) 
fiyilxi) 
fiyiM ■■ 
fiyzlxs) ■■ 
fiy^lxi) : 
f(yi\xB) * 


/i(a:j) 

_ 0^0 _ 3 
“ 0.40 “ 4 
0.10 ^ 2 
“ 0.15 3 

. t^lO _ 2 
“ 0.i5 3 

. OJl) 1 
' 0.40 “ 4 
0.06 1 
0.15 “ 3 
. 0.05 1 

o.is “ 3 

0.05 

0.05 “ ^ 


0.25 

0.25 


HiX,Y) = - ^ Y^fix,y) \ogf(x,y) 


^ “ 0 05 log 0.06 

* 2 665^^ ^ ~ ^ ~ 

H{X) = - J^/(a:, 2 /) log/i(x) 

* V 

= -a25 log 0.25 - 0.10 log 0.40 - 0.30 log 0.40 - 0.05 log 0.15 

- 2 0M ** ~ ~ ® 

* y 

- -0.25 log 0.35 - 0.10 log 0.35 - 0.30 log 0.35 - 0.05 log 0.35 

0,10 log 0.20 - 0.05 log 0.20 - 0.05 log 0.20 - 0.10 log 0.10 
“ 1.856 


= -^^Kx.y)\ogf^ 

Z y * 


- -0.10 log Ji - 0.30 log H - 0.05 log H 
- 0.10 log ^ - 0.05 log - 0.10 log % 

- 0.600 


X y 


■ 0.25 log Jf - 0.10 log K - 0.30 log ^ — 0.05 log M 

- 0.10 log - 0.05 log - 0.06 log J.£ 

’ 0.8Q9 



104 

Note that 


DISCBETE SCHEMES WITHOUT MEMORY 


H{X,Y) < H{X) ^H{Y) 

2.663 < 2.066 + 1.856 

and H{X,Y) = H{Y) + H{X\Y) = H(X) + H(Y\X) 

2.665 = 1.856 + 0.809 = 2.066 + 0.600 


3-12. A Measure of Mutual Information. Consider a discrete com- 
munication system with given joint probabilities between its input and 
output terminals. Each transmitted symbol Xi while going through the 
channel has a certain probability P[x^\yJ] of being received as a particular j 
symbol 2 /y. In the light of previous developments, one may look for a 
function relating a measure of mutual information between Xi and yj. In 
other words, how many bits of information do we obtain in knowing that 
yj corresponds to Xi when we know the over-all probability of Xi happen- 
ing along with different yl In order to avoid a complex mathematical 
presentation, we follow a procedure similar to that of Sec. 3-3. We 
assume a definition for mutual information and justify its agreement with 
that of the previously adopted definition of the entropy. Finally, we 
shall investigate some of the properties of the suggested measure of 
mutual information. A measure for the mutual information contained 
in {xi\yj) can be given as 






p{xi,yA . 
pMp\vA 


(3-122) 


This expression gives a reasonable measure of mutual information 
conveyed by a pair of symbols (Xi^y,). For a moment, we concentrate 
on the received symbol yj. Suppose that an observer is stationed at the 
receiver end at the position of the signal yj. His a priori knowledge that 
a symbol Xi is being transmitted is the marginal probability p (xi) , that is, 
the sum of the probabilities of Xi being transmitted and received as any 
one of the possible yj. The a posteriori knowledge of our observer is 
based on the conditional probability of Xi being transmitted, given that 
a particular y, is received, that is, 'p[xi\yj]. Therefore, loosely speaking, 
for this observer the gain of information is the logarithm of the ratio of 
his final and initial ignorance or uncertainties. However, the mathe- 
matically inclined reader may wish to forgo such justification and use 
(3-122) as a definition. 

The following elementary properties can be derived for the mutual 
information function : 

1. Continuity. I{Xi]yj) is a continuous function of p{Xi\yj}. 

2. Symmetry or reciprocity. The information conveyed by yj about 
Xi is the same as the information conveyed by Xi about p/, that is, 

= KVi-flA (3-123) 

Obviously, Eq. (3-122) is symmetric with respect to and yj. 



BASIC CONCEPTS OF INFORMATION THEORY 


105 


3. Mutual and self-information. The function I{x^;xi) may be called 
the self -information of a symbol Xi. That is, if an observer is stationed at 
the position of the symbol Xi his a priori knowledge of the situation is 
that Xt will be transmitted with the probability plo-i) and his a posteriori 
knowledge is the certainty that Xi has been transmitted ; thus 


Obviously, 


7 ( 2 ^ 1 ) ■f(^l;3J|) ) 

(3-124) 


(3-125) 

iW,y,) < liy,\y,) = Hy,) 

(3-12B) 


An interesting interpretation of the concept of mutual information can 
be given by obtaining the average of the mutual information per symbol 
pairs, that is, 

= TiXi-,y,) = ^ ^ (3-127) 

J i 

I{X-Y) = ^ ^ Pk.,Z/.} log (3-128) 

3 t 


It could be ascertained that this definition provides a proper measure 
for the mutual information of all the pairs of symbols. On the other 
hand, the definition ties in with our previously defined basic entropy 
formulas. Indeed, by direct application of the defining equations one 
can show that 

I(X;V) = N(X) + II(Y) - H{XJ) (3-129) 

I{X]Y) = H{X) - H{X\Y) (3-130) 

1{X]Y) = H{Y) - H{Y\X) (3-131) 

The entropy corresponding to the mutual information, that is, I{X\Y), 
indicates a measure of the information transmitted through the channel. 
For this reason it is referred to as transferred information or transin- 
formation of the channel. Note that, based on the fundamental equation 
(3-116), the right side of Eq. (3-130) is a nonnegative number. Hence, 
the average mutual information is also nonnegative, while the individual 
mutual-information quantities may become negative for some symbol 
pairs. For a noise-free channel, 

I(X;Y) = H{X) = i/(F) (3-132) 

I{X-J) = ff(X,F) (3-133) 

For a channel where the output and the input symbols are independ- 
ent, 

I{X;Y) = HiX) - HiX\Y) 

= H{X) - H{X) = 0 (3-134) 

DO information is transmitted through the channel. 



106 


DISCRETE SCHEMES WITHOUT MEMORY 


Example S-S. The joint probability matrix of a channel with binary input and 
output is given below: 

Vi Vi 

xi[H Kl 
xAh V4} 

Find the different entropies and the mutual information. 

Solution. The marginal probabilities are 

PM «PM * y2 

The entropies are 

H{X) = HiJ) = 1 
H{X,Y) = 2 

nX'.Y) = H{X) + H(Y) - H(XJ) = 0 

The transinformation is zero, as the input and the output symbols are independent. 
In other words, there is no dependence or correlation between the symbols at the 
output and the input of the channel. 

3-13. Set-theory Interpretation of Shannon’s Fundamental Inequali- 
ties. A set-theory interpretation of Shannon’s fundamental inequalities 



Fig. 3-9. A set-theory presentation of a 
simple communication system. {X,Y) 
represents the joint operation of the 
source and the channel. 



Fig, 3-10. A set-theory presentation of 
different entropies associated with a sim- 

r»1o f'nrriTmiTiinQf inn tvinrlnl 


along with the material discussed previously may be illuminating. Con- 
sider the variables A and B as sets. We may symbolically write m(A) 
and m{B) as some kind of measure (say the area) associated with sets 
X and Y, The entropies of discrete schemes are essentially nonnegative, 
and they possess the property of Eqs. (3-114) and (3-119). Thus one 
may observe that, in a sense, the law of '‘additivity" of the entropies 
holds for disjoint sets. Thus, the following symbolism may be useful in 
visualizing the interrelationships (see Figs. 3-9 and 3-10). 


m{A) 

mx) 

(3-135) 

m{B) 

H(Y) 

(3-136) 

m(A VJ B) 

HiX,Y) 

(3-137) 


H{X\Y) 

(3-138) 




BASIC CONCEPTS OF INFORMATION THEORT 107 

m(BA') »(F1Z) (3-139) 

m(A n B) IiX;Y) (3-140) 

m(A W B) < m{A) + m(B) H{X,Y) < H(X) -|- H(Y) (3-141) 

m{AB') < m(A) H{X\Y) < H{X) (3-142) 

m(BA') < m(B) HiY\X) < Hi,Y) (3-143) 

m(A \JB)= m(AB') H{X,Y) = H{X\Y) 

-I- m(BA') + m(A H B) -|- B(riX) + I{X;Y) (3-144) 


When the channel is noise-free, the two sets become “coincident” as 
follows: 

w(A) = m(B) H(X) = H{Y) (3-145) 

m(A VJ B) = m(A) = m(B) ff(X,F) = II (X) = H(F) (3-146) 

m(AB') = 0 HiX\Y) = 0 (3-147) 

m(BA') = 0 HiY\X) = 0 (3-148) 

m(A n B) = m(A) 7(X;r) = B(X) 

= m(B) = m(A W B) = II {Y) = //(X,F) (3-149) 

When the channel is such that input and output symbols are inde- 
pendent, the two sets A and B are considered mutually exclusive : 

m(A U B) = m(A) -t- m(B) H{X,Y) = //(X) -|- B(r) (3-150) 

m(AB') = m(A) H(X|r) = //(X) (3-151) 

m(A n B) = 0 /(X;K) = 0 (3-152) 

This procedure may be extended to the case of channels with several 
ports. For example, for three random variables (X,Y,Z) one may 
write 


H(X,Y,Z) < B(X) + HiY) + H{Z) (3-153) 

H{Z\X,Y) < H{Z\Y) (3-154) 

See Fig, 3-11. For a formal proof of Eqs. (3-153) and (3-154) see 
Khinchin. Similarly, one may give formal proof for the following inter- 



Pw. 3-11. A Bet-theOry presentation of the entropies associated with (X,Y,Z) space. 



108 DISCRETE SCHEMES WITHOUT MEMORY 

esting equalities : 

I{X]Y,Z) = I{X-Y) + I(X;Z\Y) (3-155) 

I{Y,Z;X) = I(Y]X) + I{Z;X\Y) (3-156) 

The set diagrams for these relations are given in Figs. 3-12 and 3-13. 


I{X‘,Y) 



Fig. 3-12. A set-theory presentation of Fig. 3-13. A set-theory presentation of the 
the entropies associated with transinforinations /(A;!') and /(A'jZl J). 

space. 


In conclusion, it seems worthwhile to present a simple set-theory rule 
for deriving relationships between different entropy functions of discrete 
schemes : 

Draw a set Ak for each random variable Xk of the multidimensional 
random variable (A'i,X 2 , . . . ,Xn). When two variables Xjk and AT, are 
independent, their representative sets will be mutually exclusive. Two 
random variables Xi and Xh describing a noise-free channel (with a 
diagonal probability matrix) will have overlapping set representation. 
The following symbolic correspondences arc suggested : 


AkA^^ H(Xk\X,) (3-157) 

Ak U Aj H{Xk,X,) (3-158) 

Akr\A, KXk\X,) (3-159) 

AkA, = 0 I{Xk]Xj) =0 - (3-160) 

Ak = A,- H(Xk) = H{X,) = H{Xk,X,) (3-161) 

A,nA2r\ • An /(XijX^; . . . ]Xn) (3-162) 

BCC H{B) < H{C) (3-163) 


3-14. Redundancy, Efficiency, and Channel Capacity. In Sec. 3-9 we 
have presented an interpretation of different entropies in a communica- 
tion system. It was also pointed out that the transinformation I(X;Y) 
indicates a measure of the average information per symbol transmitted in 
the system. The significance of this statement is made clear by referring 



BASIC CONCEPTS OP INFORMATION THEORY 


109 


to Eq. (3-127). In this section, it is intended to introduce a suitable 
measure for efficiency of transmission of information by making a com- 
parison between the actual rate and the upper bound of the rate of trans- 
mission of information for a given channel. In this respect, Shannon has 
introduced the significant concept of channel capacity. According to 
Shannon, in a discrete communication system the channel capacity is the 
maximum of transinformation. 

C = max I(X\Y) = max [H(X) — ff(X|F)] (3-164) 

The maximization is with respect to all possible sets of probabilities that 
could be assigned to the source alphabet, that is, all discrete memoryless 
sources. Before proceeding with examples of application and computa- 
tion of channel capacity, a somewhat analogous concept from linear net- 
work theory may be worth mentioning. Consider a linear, resistive, 
passive, two-port network connected to a linear resistor R at its output 
terminals. The power dissipated in R under a given regime depends on 
the network and the load. The maximum power dissipated in the load 
occurs when there is a matching between the load and the network, i.e., 
when the resistance of the network seen from the output terminals is 
identical with R. This situation can be further analyzed by observing 
that, for a given network, the power transferred to the load depends on 
the value of the load ; the maximum power transfer occurs only when the 
load and the source are properly matched through a transducer. In a 
discrete communication channel, with prespecified noise characteristics, 
i.e., with a given transition probability matrix, the rate of information 
transmission depends on the source that drives the channel. Note that, 
in the network analogy, one could specify the load and determine the 
class of transducers that would match the given load to a specified class 
of sources. The maximum (or the upper bound) of the rate of informa- 
tion transmission corresponds to a proper matching of the source and 
the channel. This ideal characterization of the source depends in turn 
on the probability transition characteristics of the given channel. 

Discrete Noiseless Channels. The following is an example of the evalua- 
tion of the channel capacity of the simplest type of sources. 

Let Z = (a:,) be the alphabet of a sourca containing n symbols. Since 
the transition probability matrix is of the diagonal type, we have, accord- 
ing to Eq. (3-132), 

C = max I{X;Y) = max [H{X)] = max [ - £ p{x,\ log pli.) ] (3-165) 

i=«l 

According to Eq. (3-14), the maximum of H(X) occurs when all symbols 
are equiprobable; thus the channel capacity is 

C = log n bits per symbol 


(3-166) 



110 


DISCRETE SCHEMES WITHOUT MEMORY 


The capacity of a channel, as well as the rate of transmission of inf(yrma- 
lion through the channel, can be equivalently expressed in bits per second 
instead of bits per symbol. For this, one has to introduce the concept of 
time required for the transmission of individual symbols. For instance, 
if the symbols have a common duration of t seconds, then the channel 
capacity per second Ct is given by 

Ct = j C bits per second (3-167) 

For the simple noise-free communication system described above, we 
have 

C* = y C = ^ log n bits per second (3-168) 


The difference between the actual rate of transmission of information 
I[X\Y) and its maximum possible value is defined as the {absolute) redun- 
dancy of the communication system. The ratio of absolute redundancy 
to channel capacity is defined as the relative redundancy. For the 
afore-mentioned system. 


Absolute redundancy for noise-free channel = C — I{X]Y) 

= logn - H{X) (3-169) 

Relative redundancy for noise-free channel = — ^ \og n ^^^ 

= ( 3 . 170 ) 

log R 


The efficiency of the above system can be defined in an obvious fashion as 


Efficiency of noise-free channel = 

log n log n 

= 1 — relative redundancy (3-171) 


When the time for the transmission of symbols is not necessarily equal, 
a similar procedure may be applied. Let U be the time associated with 
the symbol Xi; then the average transinformation of a noise-free channel 
per unit time is _ 

n 

- X pM log p[xi\ 

Bt = ( 3 - 172 ) 

£ pMti 

t-1 


Rt is known as the rate of transmission of information. For a given set 
of (z = 1, 2, . . . , n), one can evaluate the Xi leading to the maximum 
rate of transmission of information per second Ct. The computation will 
not be undertaken here, 



BASIC CONCEPTS OP INPOBMATION THBOBT 


111 


Discrete Noisy Channel. The channel capacity is the maximum of the 
average mutual information when the noise characteristic p{y>|a:<) of the 
channel is prespecified. 

n m 

C = max ^ pA^i\v[Vi\xi\ log (3-173) 

where the maximization is with respect to pi ( ) . Note that the marginal 
probabilities P 2 { ?/; 1 are related to the independent variables pi ( ) through 
the familiar relation 

n 

PilUi] = S Pi{-r.lp!?yjlr,) (3-174) 

i»l 

Furthermore, the variables are, of course, restricted by the following 
constraints : 

Pi{x,\ >0 ^■ = 1, 2, . . . , n (3-175) 

^ = 1 

The maximization of (3-173) with respect to the input probabilities does 
not necessarily lead to a set of admissible source-symbol probabilities. 

From the physical point of view, the problem of channel capacity is a 
rather complex one. The communication channels are not generally of 
the aforesaid simplest types. When there is an interdependence between 
successive channel symbols, the statistical identification of the source and 
the maximization problem are more cumbersome. In these more general 
cases, the system will exhibit a stochastic nature. Therefore, more 
elaborate techniques need to be introduced for deriving the channel 
capacity of such systems. Because of this complexity. Shannon's 
fundamental channel-capacity theorems require adequate preliminary 
preparation. These will be considered in a later chapter. 

3-16. Capacity of Channels with Symmetric Noise Structures. The 
computation of the channel capacity in general is a tedious mathematical 
problem, although its formulation is straightforward. The procedure 
of maximization requires some special mathematical techniques such as 
the method of Lagrangian multipliers. In the present section we should 
like to compute the capacity of some special channels with symmetric 
noise characteristics as considered by Shannon. 

Consider a channel such that each input letter is transformed into a 
finite number of output letters with a similar set of probabilities for all 
the input letters. In this case the channel characteristic matrix 
contains identical rows and identical columns but not necessarily in the 



112 DISCRETE SCHEMES WITHOUT MEMORY 

sanie position. See J^'ig. E8-6, where we have 


TH H H Ml 
\h M M Hi 


For such channels the capacity can be comynited without any difficulty. 
The k(‘y to the simplification is the fact that iho conditional entropy 
H{Y\X) is independent of the probability distribution at the input. 
Indeed, for a letter with marginal probability a, we may write 

pIiIjW,] = a„ (3-176) 

P I •■»■„?/;} = a, a., (3-177) 

The conditional entropy pertinent to the letter x, will be 


7/(}'|x,) = - ^ p[;/y|-'-,l p|//j|.r,l (3-178) 

J 1 

Now let 

II{Y\x,) = const = h for ? = 1, 2, . . . , 

Thus, Eq. (8-89) yi(9ds 

H{y\X) = (oi + 02 + • • • + an)li = h (8-179) 


That is, the average conditional (uitropy is a constant number inde- 
pendent of the probabilities of the letters at the input of the channel. 


Transmitter Receiver Transmitter Receiver 




Imo. ;i-I 1. A clmniu*! with :i jimi ticnljirly .syniiiH'tric stnu iuro. 

Therefore, instead of maximizing the expression //(F) — //(FjX), we 
may simply maximize the expression H{Y) — h, or H{Y), as h is a con- 
stant. But the maximum of H(Y) occurs when all the received letters 
have the same probabilities, that is, 

C = log rn — h (3-180) 

We may wish to investigate further what restriction Eq. (3-180) 
imposes on the channel. For this, reference can be made to the channel 
probability matrix and the conditional probability matrix P{X\Y] of 



BASIC CONCEPTS OF INFORMATION THEORY 


113 


the system. Let 'p[x^\^JJ\ = and note that 

P I 3^, I Of,’ , = 71 { //j I ^ij 

It can be shown that the conditional probability matrix will also have 
identical rows, that is, the tree of all probabilities at the output of the 
channel assumes similar symmetry for all sets of the received letters 
(Fig. 3-14). Furthermore it can be shown that the probabilities of the 
transmitted letters pja*,) will have to be equal for i = 1, 2, . . . , n. 
Conversely, if the situation of Fig. 3-14 prevails, then 

C = max [U{X) - IIiX\Y)] 

= max [II (A"^)] — h! — log n — K 

where h' is the conditional entropy II{X\y). 

Example 3-6. I'ind tho capacity of the channel illustrated in Fip;. J']3-G. 



Frci. E3-() 

Solution. Applying Eq. (3-180), one finds 

C = log 1 — log 3 — J 3 log 6 
^ ^ — les ^ hds 

Example 3-7. A binary channel has the following noise characteristic: 

0 J 

HI 

lUi Hi 

(a) If the input symbols are transmitted with respective probabilities of H 
find 

H(X), ff(y). //(AID, //(K|A), I(X;V) 

(h) Find the channel capacity and the corresponding input probabilities. 
Solution 

(a) HiX) = 0.81 H{Y) ^ 0.98 

//(AID = 0.75 ff(riA) = 0.92 

/(A;D 0.06 

C = 1 -h ^ log -h H loR H = H - log 3 = 0.08 

P(0I “ p|l| - i-i 


( 6 ) 



114 


DISCRETE SCHEMES WITHOUT MEMORY 


3-16. BSC and BEC. The simplest type of source alphabet to be con- 
sidered is binary {0,1]. In this section we assume that the output of 
such a source is transmitted via a binary symmetric (BSC) or a binary 
erasure (BEC) channel. Figure 3-15 shows a BSC and Fig. 3-16 a BEC. 

0 

y 

1 

Fici. 3-15. A binary .symmetric channel Fin. 3-10. A binary erasure channel 
(BSC). (BIOC). 

The rate of transmission of information and the capacity will be derived 
for each case. 

BSC. Let 

/^|01 = a y^ill = 1 - a 
p{o\o\ = nm = p 
y^(0|ii = y'ji|()| = q 

Then 

H(X) = 11 {a, 1 — a) = — a log a — (1 — a) log (1 — a) 

I1{X\Y) = - (p log p + q log q) 

1{X;Y) = Mia, I - a) + p \ofr p + q log q 

C* = 1 -1- p log p -f- g log g (3-181) 

BEC. The channel has two input (0,1) and three output symbols 
(0,z/,l 1- The letter y indicates the fact that the output is erased and no 
d(^terministic decision can be made as to whether the transmitted letter 
was 0 or 1 . L(^t 

7^(01 = a 7^{1) = ] - a 
/M0|0) = p 

^1i/|oi = y^ipiii = g 

Then 

yy(A') = Ilia, 1 - a) 
lliX\Y) = 1 - p 
y(X;F) = Ilia, 1 - a) -h p - 1 

C = p (3-182) 

Equations (3-181) and (3-182), specifying the capacity of BSC and BEC, 
respectively, will be referred to frequently in the subsequent discussion. 

Example 3-8. Consider the BSC shown in Fig. E3-8. Aasiiine P 10) = a and that 
the successive symbols are transmitted independently. If the channel transmits, 
all possible binary words U of length 2 which are received as binary words V, derive 




BASIC CONCEPTS OF INFORMATION THEORY 


115 


(a) The input entropy JI(U), 

(b) The equivocation entropy If (If IV). 

(c) The capacity of the new channel (called the second-order extension of the first 
channel). 



OVi 

U2 \ 


Ui 


Ui 

V 4 


Fig. 


(d) Generalize the results for the case of transmitting words each n binary digits 
long. 

Solution. Let U be a random variable encompassing all tin' binary words 00, 01, 10, 
and 11 at the input. Let Xi and A '2 be random variables referring to symbols in the 
first and the second position of each word, respectively. Similarly, let F, lb, and r 2 
correspond to the outfiiit. Symbolically we may write 

U = A,, A 2 
V = lb, lb 

Because of lack of memory, the probability distributions are given by 

r\U\ =P|Ab|F(Ab| 

P1F| =P{F,|P{}b| 

P\V\U] =P|ri|Ab)P|}b|A2} 

P\V\V\ =P|Ab|lblP(Ab|r2} 

P\U,V\ =P{U\P\V\U\ =P|Ab,y,lP(Ab,)bl 

(a) The source entropy //(T) can be thought of as the entropy associated with the 
two independent random variables Ab and Ab. Thus 

H(U) = 7/(Ab,Ab) = //(Ab) + P(Ab) = -2[a log a f (1 - a) log (1 - or)) 
since H(Xi) = H(X 2 ). 

(h) H(U,V) = //(Ab,)b) -f //(Ab,>b) 

H(V\U) = /f(lb|Ab) +//(F2|Ab) 

//(r/|F) =//(Ab|)b) +//(Abiyb) 

(c) The transiiiformation becomes 

I(U;V) = H{U) - H{U\V) = 2/(Ab;Fi) 

The extended channel capacity is twice the capacity of the original channel. 

(d) Similarly, one can show that the capacity of the wth-order extension of the 
channel equals no, where c is the capacity of the original channel. Note that this 
statement is independent of the structure of the channel; that is, it holds for any 
memoryloss channel. 

3-17. Capacity of Binary Channels. Binary channels are of consider- 
able interest in the transmission and storage of information. The vast 
field of digital computers offers many examples of such information 



no 


DISCRETE SCHEMES WITHOUT MEMORY 


channels. The problem undertaken in this section is the evaluation of 
the maximum rate of transmission of information of binary channels. 

The source transmits independently two sym- 
bols, say 1 and 0, with respective probabilities 
pi and p 2 . The channel characteristic is known 
as (see Fig. 3-17) 

Pl2 
P 21 p22 

In order to evaluate the capacity of such a, 
channel, when the entropy curve is available a simple geometric procedure 
can be devised (see Fig. 3-18). 

The points A 1 and A 2 on the segment OA/ are s(‘k'cted so that 
AfA j = pu OA 2 = P 22 

The ordinates of the entropy curve at A 1 and A 2 are 

= 7/(pii) = H{pri) 

Now, for any given channel output probabilities such as OA = p and 



Fio. 3-18. A gooiiiotrii; dolorniiiiation of clitlen‘iit entropios, tniiisiiiformation, and 
channel capacity of a IJSC. 

M A = \ — Py the transinformation can be geometrically identified. 
In fact, 

I{X-Y) = H{Y)- H{Y\X) 

7(X;F) = n^) -jnHipu) - pJI{p,2) 

1{X-,Y) =:'BA - Fa 

Of course, the point A corresponding to the desired mode of operation is 
not known. A glance at Fig. 3-18 suggests that the largest value of 




BASIC CONCEPTS OF INFORMATION THEORY 


117 


trail, si nformatioii is obtained when the probabilities at the receiving end 
are represented by point corresponding to point The tangent 

of the entropy curve at point if^ parallel to B 1 B 2 . At the vertical 
segment representing the transinformation assumes its large, st value. 
The corresponding source probabilities can be derived in a direct manner. 
(\ E. Shannon has generalized this procedure to II X 3 and more complex 
channels.* Ills procedure is based on the use of a barycentric coordinate 
system. For complex channels, however, an analytic approach is often 
more desirable than a geometric procedure. 

The following method for evaluation of the channel capacity has been 
suggested by S. Muroga. First one introduces auxiliary variables Qi 
and Qi which .satisfy the following equations: 

Pii(?i + pnQi = +{pn log pn + Pvi log pu) (8-183) 

PnQi + P 22 Q 2 = +ip 2 \ log P 21 + P 22 log P 22 ) 

The rate of transmission of information I{X;Y) can be written as 

I(X',Y) = //(F) - //(FIX) = -(p[ log p[ + P 2 log P 2 ) 

+ P\{pu log pn + Pn log P 12 ) + P 2 {p 2 i log P 21 + P 22 log P 22 ) (3-184) 

wh('re pi and pi are the probabilities of receiving 1 and 0 at the output 

port, respectively. Next, we introduce Qi and Q 2 into Eq. (3-184), 
through Eq. (3-183) : 

I{X]Y) = -(pi log pi + P 2 log pi) + (pipii + p 2 P 2 i)Qi 

+ (PlPl2 + P2p22)Q2 

Thus, IiX]Y) = - (pi log pi -f- pi log pi) + ?;iQi + piQ 2 

Th(‘ maximization of I{X\Y) is now done with respect to pi and pi, 
the probabilities at the output. In order to do this, we may use the 
method of Lagrajigian multipliers. This method suggests maximizing 
the function 

V = —{jp[ log pi + pi log pi) -1- p\Qi + p^Qi + M(pi + p' 2 ) (3-185) 

through a proper selection of the constant number p. Therefore one 
re(iuires 

= - (log e. + log pi) + Qi + M = 0 

(3-186) 

_ = - (log e + log P 2 ) + Q 2 + P = 0 
dp2 

The simultaneous validity of these equations requires that 

M = —Qi+ (log e + log pi) = —Q 2 + (log e + log pi) (3-187) 

* C, E. Shannon, Oeoinetrische Deutung einiger Ergebnisse bei der Berechnuag der 
Kanalkapazitat, NTZ-Nachrlech Z., vol, 10, no. 1, pp. 1-4, 1957, 



118 


DISCRETE SCHEMES WITHOUT MEMORY 



Fig. 3-19. A chart for determining values of Qi in terms of and 7^22 for binary 
channels. The corresponding value of Qi is obtained by an intcrtjhange of Pn and 7^22. 



Pn 


Fio. 3-20. Capacity of a binary channel in terms of Pn and P 32 . 



BASIC CONCEPTS OP INFORMATION THEORY 


119 


The channel capacity is found to be 

C = max [7(X;y)l = Qi - log v\ = Qi- log pj 


The values of Qi and Q 2 may be obtained from the set of Eqs. (3-183). 
But note that 

pI = exp (Qi — C) i = 1, 2 (3-188) 


Thus 


C = log [exp (Qi) -I- exp (Q 2 )] (3-189) 

1 = 2 

C = log 2 exp (Q.) = log (2«. 4- 200 


A similar result was obtained earlier in a different way by Shannon. 
Later on Silverman and Chang derived further additional interesting 
results. The ehart of Fig. 3-19 gives the value of Qi and Q 2 for a binary 
channel. The chart of Fig. 3-20 gives the corresponding capacities 
(lEB Trans, on Inform. Theory, vol. IT-4, p. 153, December, 1958). 
Note that the capacity of a binary channel is greater than zero except 
when 

pi\ + p22 = 1 

Example 3-9. Find the rapafity of the following three* binary channels, first 
directly and then from the graph of Figs. 3-11) and 3-20, in each of the following 
three cases: 

(a) pii — P 22 — 1 

(h) pii = pvi = P21 = P22 = 

(c) pn = pi2 = p2i = U Vri == H 

Solvtion 

(a) 

Pn = P22 = 1 

P 12 = P 21 = 0 

Direct computation yields 

Qi = Q 2 - 0 

C = log (2Vi -f - 1 bit 

This channel capacity is achieved when the input symbols are equiprobable. 

(D Pii = P12 = p-n = P22 = ’2 

The noise matrix is singular and leads to 


0i + ^2 = -2 

Q\ = Q2 — ”1 

C = log (2-1 + 2-1) = 0 

Any input probability distribution will lead to zero transinformation as the input and 
the output are independent. This result can be verified by checking with Figs. 3-19 
and 3-20. 



120 


DISCRETE SCHEMES WITHOUT MEMORY 


(C) Pll = Pj2 = M 

V21 = Vat P22 = Va 

TH + r-i 1 

IH L>i log +?^ 1 ok-* 4J L-2 + log,3j 

[Qil ^ r 1 - log 31 ^ r -1.3781 
IQ 2 ] L -3 + ?^ log 31 L-O.622J 

C = log (2"1 378 4. 2-0.622) == lop 1.0345 = 0.048 bit 

This answ(*r can bo vorilied from the graph of P'ig. 3-20. 

The generalization of the above method for a channel with an in X m 
noise matrix is straightforward. In fact, let 

m 

PuQl + ■ ■ ■ + Pln^Qm = ^ Plj log pij 

J = 1 

(3-190) 

m 

Pinl(^\ “H PmmC^jn ^ P^nj ^t)g Prnj 

j=\ 

and assume that the solution to this set exists. Tlien tlie rate of trans- 
mission of information, as before, will become 


- y, v’. log p! + X 

1=1 1=1 

The us(! of the Lugraiigiun multiplier method will lead to 

(' = Q: — log J)', 

m 

C = log ^ 2«. 

1=1 


(3-191) 


(3-192) 


It is to be kept in mind that tlie values of C thus obtained may not neces- 
sarily correspond to a set of realizable input probabilities 

(o < p. < 1 , X p> = 0 

I 

In the latter case the calculation of the channel capacity is more compli- 
cated. Also, the solution to the set [Fa{. (3-190)1 may not i xist, or the 
channel matrix may not even be a square matrix. In such cases some 
modifications of the above method are suggested in the afore-nunitioned 
references. At any rate, although the formulation of the ecfuations 
leading to the channel capacity is straightforward, computational dif- 
ficulties exist and the present methods are not completely satisfactoiy. 

The capacity of a general binary channel has been computed by 
R. A. Silverman, S. Chang, and J. Loeb. A straightforward computation 
leads to 



BASIC CONC EPTS OF INFORMATION THEORY 121 

whore parameior a = pn, 0 = P 22 , and H stands for the entropy of a 
l)inary source. Note that 

(7(a,/3) = 1 - « = C(1 - 1 - a) 

Tlie input probability P\0} leading to the channel capacity is given by 
Silverman and Chang as 

r{0l = P(a,0) = Kli - a)-' - (/3 - «)-• 1^1 + exp 
0.37 ^ - < /'{Ol < 1 - 1 ^ 0.()3 


The probability of receiving zeros when the channel capacity is achieved 

Example 3-10. Kind iho c.niiac'ity of the channel witli the noise matrix as shown 
helow : 

r'2 bi 0 «ri 

0 1 0 0 

0010 
'4 0 '2~ 


l^oh(hofi 


C = 

Example 3-11. 

matrix 


Q, =Q,= -2 

(h = = 0 

log (2 2 + 2^ + 2" -+ 2-2) = log 5 - 1 = 1.321 bits 
Determine the capacity of a ternary channel with the stochastic 


[P\ 


a 1 — a 0 

'2 0 '2 

0 1 - a a 

0 < a < 1 


Soliilwn. Since the channel matrix is a square matrix, Eqs. (3-190) yield 


l^IlQl = -W 

[Q] = 

(h 

Q2 
Qn 

where h — —a log a — (1 — a) log (1 — a) 

I -1 d 



L -1 J 

C = log (2-^1 T 2~®a + 2-^3) = log + exp j 


2ac 

1 


1 

2q! 

J 


2(1 - a) 
1 

2a 


- 1 2(1 - a) 

1 

2a 


1 



122 


DISCRETE SCHEMES WITHOUT MEMORY 


According to the method applied by Muroga, Silverman, and Chang, a direct com- 
putation leads to the following values for the probability of the ith input symbol 
achieving channel capacity. 

m 

Vi = y 1 < t < m 

t-1 


where is the clement of the inverse channel matrix IP]~^. 

In this example, one finds 

Pi = Pa = 2“^ • 

^ 1 + [^/(g - 1)1 exp f(a - h)/{l - a)] 

1 + exp [(a - A)/(l - a)l 

Of course, if we desire to employ this method, the input probabilities must remain 
nonnegative. The condition pi > 0 yields 


The equality is valid for a « 0.G41, and the channel capacity is achieved when 
a > 0.641. 


3-18. Binary Pulse Width Communication Channel. In many practi- 
cal problems, there is a time (or cost) associatcHl with the transmission 
of each letter of the alphabet. In such circumstances, it is desired to 
investigate the rate of transmission of information and the capacity of 
the channel in the absence of noise. To obtain the rate of transmission of 
information, we assume independence of successive letters and we con- 
sider transmitting all words of duration T with equal probabilities. 
Therefore 


p log N{T) 

fl = ~~ 


(3-193) 


Shannon has defined the capacity of this noiseless channel as the limit 
of R when very long messages are considered, e.g.. 


C = 


lim 


log N{T) 
T 


(3-194) 


Thus, the problem of calculation of the capacity for such communication 
channels, under the above assumptions, is reduced to a combinatorial 
problem, that is, computing N{T). 

Let the alphabet be [ai,a 2 , . . . ,an] and the associated duration 
[^ 1 ,^ 2 , . . . itn]' The number of distinct words of duration T is given 

by 

N{T) = N{T - h) 

k-\ 


(3-195) 



BASIC CONCEPTS OF INFORMATION THEORY 123 

This is a difference equation. The general solution of this equation is 

n 

N{T) = V (3-196) 

where the VkS are roots of the characteristic equation 

n 

1 _ ^ = I - fir) = 0 (3-197) 


The constants Ai^ A 2 , . . . y Ak depend on the boundary conditions 
of the problem. At the moment, we are interested in an evaluation of 
NiT) for very large values of T. From Eq. (3-197) it is clear that/(r) is a 
monotonic decreasing function ; 


m = ^ /(^) = 0 


Therefore, the equation /(r) = 1 cannot have more than one positive 
real root. Hence for large values of T, the function N(T) behaves as 
A»r/, where is the positive root of the characteristic equation (3-197). 
In the absence of additional constraint, the channel capacity becomes 


c . lim 

OP J- 

C = log r. 


(3-198) 


Example 3-12. Consider an alphabet consisting of two rectangular pulses of equal 
heights. The duration of the pulses are 2 and 4 time units. Find the capacity of a 
noiseless channel transmitting very long messages. (The messages are fed to the 
channel with equal probability and independently.) 

Solution. The difference equation to be solved is 


N{T) = N{T - 2) + N{r - 4) 

Assuming A(— 3) - N{—2) = N{ — \) =0, straightforward computation yields 


T 

N{T) 

log N{T) 

|log N{.T) 

2 

1 

0 

0 

3 

1 

0 

0 

4 

2 

1 

0.250 

5 

2 

1 

0.200 

6 

3 

1 585 

0.264 

7 

3 

1 585 

0 226 

8 

5 

2 322 

0.290 

9 

5 

2.322 

0.258 

10 

8 

3 000 

0.300 

15 

21 

4.392 

0.292 

19 

55 

5.781 

0.304 



124 


DISCRETE SCHEMES WITHOUT MEMORY 


Comparison brtwccn successive increments for values about 7" = l9to T = 23 shows 
that (1/7') log N{7') approaches the value C = 0.342. We may alternatively employ 
Kq. (3-197): 

1 - f(r) =1 - r"2 _ = 0 



The only positives root of the characteristic equaticui is ri = (0.035) 

C = -1^ log 0.035 = 0.328 

3-19. Uniqueness of the Entropy Function. Wc have adopted the > 
logarithmic form for the eommunieatiou entropy as tlie most suitable 
form satisfying certain specified reciuirements. In this s(‘ction we wish 
to prove formally that, if a few^ specified re(|uirements are to be fuKilled, 
the logarithmic form is the uniciue function satisfying these constraints. 
The complexity of the proof depends on the type of constraints imposed. 
The following requirements for an entropy function seem to be reasonable. 

1. Given a finite complete probability scheme \p\,'Pi, . . . /Pn], the 
associated enrropy function //(pi,p 2 , . • • ,7h.) must take on its largest 
value w^hen all events are ecjuiprobable. 

2. For a joint, finite complete scheme the associated entropies should 
satisfy the identity 

//(X,F) = H(X)+IJ{Y\X) 

The average information conveyed l)y (A",F) is the sum of tin* average 
information given by X and that provided by Y when X is given. 

3. Adding an impossible event to a scheme should not change the 
entropy of the scheme. 

//(P1,P2, . . . ,P»,0) = . ■ . ,Pn) 

4. The entropy function is continuous with respect to all its arguments. 
Theorem. Let i/(p],p 2 , . . ■ ,Pn) be a function satisfying require- 

ment.s 1, 2, 3, and 4 above for any values of p/r (A: = 1, 2, . . . , n); 
then 


HiVhPh • ■ • ,Vn) = -X 2 p, log p. X > 0 

1=1 

Proof. Let 

H = f{n) (3-199) 

\n n n) 

The first step in the proof is to show that /(n) = \ log n. In fact, 


m = H 


( 1 , 1 . 

\n n 



1 1 
n + l’ n + l’ 


= fin + \) (3-200) 



BASIC CONCEPTS OP INFORMATION THEORY 


125 


Thus, the desired /(n) is a nondecreasing function of n. Note that 
according to requirement 2, for any complete probability scheme con- 
sisting of the sum of m mutually exclusive schemes, we can write 


i/(Xi,X2, . . . ,X.) = I H(X\) 

If each scheme consists of r equally likely events, we have 


lJ{Xi,Xz, . . . ,X„,) = m/(r) = /(r') 


(3-201) 


(3-202) 


m and r being any arbitrary positive integers. Now we choose integers 
t and n such that 

^ jii ^ I 

or m log r < n log t < {rn + 1) log r (3-203) 

m ^ log t ^ m + ] 

71 log r 71 

From the nondecreasing property of /(At) we conclude that 


/(r-) < m < /(r-+^) 
mfif) < nj{t) < (?Ai + l)/(r) 

- ' < 

n - f{r) - n 


(3-204'i 


C/omparisoii between l!](]s. (3-204) and (3-203) yields 

/(O _ < 1 

f{r) log T — n 


(3-205) 


Since n can be chosen arbitrarily large, for any positive integers r and 
/, we have 

m ^ f(r)_ 
log t log r 

or fit) = X log t (3-200) 

The nondeereasing property of f{t) requires that X be a positive constant 
Thus we have proved the uniqueness theorem for the special case when 
all events are (‘quiprobable. Next, we consider the case where all the 
probabilities are retiuired to be rational numbers l)ut not n(‘C(\ssarily all 
eciual. Let a be a common denominator for the different rational pk and 
let 

n 

Vk = ^ ^^otk = a aifc > 0 /c = 1, 2, . . . , n (3-207) 

k = \ 

In order to define the entropy of this scheme (X), we shall transfer the 
problem to the previously discussed case. For this, consider a probability 



126 


DISCRETE SCHEMES WITHOUT MEMORY 


scheme Y depending on X. Let the scheme Z consist of a equally likely 
events; [zi,Z 2 , . • ■ Lor convenience we partition these events into 

groups containing ai, a 2 , ■ . . , and events, respectively. This 
partitioned scheme will be referred to as scheme Y. When the event X^ 
with probability ak occurs, then in scheme Y all events partitioned in the 
fcth group occur with equal probability. Therefore, 

H = \ log a* 

\ak Oik Oik/ 

H{Y\X) = y ocjl (l, 1 , • . . , 1 ) 

Z-/ \o(k Oik Oik/ 

= X ^ Pa log Pi + X log a 

The totality of events in Z forms the sum of the two schemes: 

J1{Z) = HiX,Y) = /(a) = X log a (3-209) 

But, according to the additivity reejuirement, 

H{X) = H{X,Y) - n{Y\X) 

n 

= X P* 2"* (3-210) 

Thus the uniqueness theorem is also valid when the pk are rational num- 
bers. Finally the postulate of continuity of the entropy function guaran- 
tees the validity of the theorem when the pk are incommensurable. The 
proof given here is based on Khinchin's elegant presentation of Shannon's 
original idea. A more extensive proof based on less restrictive rcciuire- 
nients has been very neatly derived by Fadiev [sec Feinstein (I, Chap. 1)]. 


(3-208) , 

\ 

\ 


PROBLEMS 


3 - 1 . For a binary channel with 


driven by a source 


P\B^\A,\ = p, P\B2\A,] = p^ 


P{A,] = a P\A2\ = 1 - a 


find 

(а) The average information rate of the input letters. 

(б) The average information rate of the output letters, 
(c) The average transinformation. 



BASIC CONCEPTS OF INFORMATION THEORY 


127 


(d) The results of (a), (6), and (c) when a = pi = p 2 = 

(e) The input probabilities which make the transinformation a maximum for 
pi = Pz =“ H- 

(/) What is the capacity of the channel described in (e) ? 

3 - 2 . Find the capacity of the channel illustrated in Fig. P3-2. 


A,. .B, 




3 - 3 . (a) Compute the transinformation I(X;Y) in the channel in Fig. 1*8-3, 
when symbols Ai and A 2 arc transmitted with respective probabilities ai and a 2 

(CKI Ot2 = 1)- 

(h) Compute part (a) for pi = pz = Je, ai == ai! = '4- 

(r) In part (h) assume that at the receiving station, wlien B 2 is received, we “decide” 
that most likely At was transmitted. This assumption provides us with a new 
transinformation I(X;Y) which it is desired to calculate. 

3 - 4 . In the channel in Fig. P3-4 the messages [jj,J‘ 2,3:3| arc transmitted with respec- 
tive probabilities [oiijcrajaa]. Find the channel cMpacity. 



Fig. P3-4 


Fui. 1*3-5 


3 - 6 , A discrete source transmits messages [AiyA 2 iAz] with respective probabilities 
The source is connected to the channel given in Fig. P3-5. Determine 

(a) H(X). 

(b) H{Y). 

(c) /(X;y). 

(d) Channel capacity. 



128 


])ISrRETE SCHEMES WITHOUT MEMORY 


3-6. Same question as in Prob. 3-5 for the channel in Fig. 1*3-0: 

P\A,] = 0.6 P\A2\ - 0.3 PMal = O.i 



3-7. Find the capacity of a binary channel with th(‘ clianrK'l matrices sliown. 


(a) 


.'/"I 


(/^) 


1 l' 

"1 



L /lo 

; 10 J 



L -10 

^0 1 


3-8. 

Compute the channel capacity when the chanrud 

is specitif 

following matrices: 







(a) 

'■'i 

'i o' 


(i>) 


4 

0 ■ 




■‘i '8 




0 

'i 



U' 

0 



0 

^4 

'i 


(c) 

0.8 

0.1 o.r 

(d) 


’8 

>8 

0 ~ 


0.1 

0.8 0.1 


' 8 

■'4 

0 

•h 


0.1 

0.1 0.8_ 



'i 

i'i 

0 






. 0 

0 

?4 

•U- 


3-9. Using the method of Lagrange multipliers, give an alternative proof for l'>p 
(3-14), that is, the average information H has its maximum when all events are 
equiprobable. 

Hint: Determine the constant X such that the function 


n _ 

H{PuPi, . . . ,p,,) + ^ ^ P* 

t = i 

reaches its maximum value. 

3-10. The following two finiti' probability schemes arc given: lpi,P 2 , - ■ ■ ,^ 7 .] and 
l</i,g 2 ; . . . ,<7hJ. Show that 


n n 

- ^ Pi log 9* < - ^ pi log Pk 

k=l A: =* 1 

Hint: Let i/jt = p* + r* and express the above inequality in terms of ph and rk- 
Finally, apply Eq. (3-50) to the variable ;c » 1 4- r/p. 



BASIC CONCEPTS OF INFORMATION THEORY 


129 


3 - 11 . It is possible to liave the maxim urn of transiiiformatioii for more than one 
input prot)ability distribution. The channel in Kip. P3-1 1 illustrstes such a situation. 

Find /(A ;)^) and the condition for its maximum. 



lui IM-11 


3 - 12 . Same problem as in lOxampie H-12, but v\ ith the duration of th(‘ pulses 2 and 
5 time' units. 

3 - 13 . Find the capacity of tlu' memoryh*ss channel sjx'cihed by the matrix be.low : 

'4 *4 0 

*4 >4 ’l 

0 0 10 

. * 0 0 1 9 

3 - 14 . Under the hypothesis ol Sec. 11-18, considei a t(‘leM,raph cluiiinel where the 
symbols and their durations are 

dot 2 time units 

dash 4 time units 

sjiace 3 time units 

(a) Derive the capacity of the channel for Aa'ry long; ('ipiiprobabh* messages by direct 
computation. 

(h) Calculate the channel capacity by the describi'd method of solving; the charac- 
teristic equation J — /(r) = 0. 

3 - 16 . In actual imlsi'-type communication liki^ telegraphy, some additional con- 
straint should be kept in mind. I'or I'vample, in ordinary teh'^raphy we may considi'r 
the alphabet as consisting; of four syndiols: dot, dash, letter space, and word siiaci’. 
Two spaces may not be transmitted successively. A rehwant diagram is givi'ii m 
Fig. P3-15 (dot 2 time units, letter space 3 time units, dash 4 tinu' units, ami word 
space C time units'). 

Extend the calculation of Sec. 3-18 to this ease and derive a formula for the channel 
capacity under the hypothesis described in Sec. 3-18 [see S. Goldman (Appendix 1) 
and L. Brillouin (I, Chap. 4)\. 

Dot 


Letter space 
Fig, P3-15 



Dash 



130 


DISCRETE SCHEMES WITHOUT MEMORY 


8 - 16 . An independent source transmits mcssaf^es fxi, 12,2:3] with probabilities 
[0.4, 0.3, 0.3]. The messages arc transmitted over a noisy channel. Knowing the 
joint probability matrix for the transmitted-received pairs, find the channel matrix. 

yi 2/2 Vi 
Xi[0.2Q 0.05 0.05 
a-2 0.05 0.25 0.05 
z.40.05 0.05 0.25 

3 - 17 . Find the rate of transmission of three equiprobable messages over the follow- 
ing channel: 

MO 0 0 0' 

0 0 M M 0 0 

_o 0 0 0 M2 M. 

3 - 18 . Determine the capacity of the channel 

"M M 0" 

0 

0 0 1 _ 

3 - 19 . Let 


rii!n2! • * • '^<*1 

where n = rii rti + • • ■ + Uk 

Using Stirling's formula, show that for large values of Til, 712, . ■ . we have 

A; 

log N = —n y -- log — 

Lt 'n, n 
1=1 

Discuss the connection between this result and the definition of the entropy. [See 
Brillouin (I, pp. 7*8). ] 

3 - 20 . (Advanced Problem.) A discrete random variable X with a specified first 
moment m may assume any one of a number n values with different probabilities. 
Using the Lagrange multiplier technique, find the probability distribution that gives 
the maximum entropy for the given m. 

Hint: The problem is not a very simple one despite its appearance. A solution of 
this problem is given by B. S. Fieisiiman and G. B. Linkovskii, Maximum Entropy 
of an Unknown Discrete Distribution with Given First Moment (Hadiotekh. i Elek~ 
iron., vol. 3, no. 4, pp. 554-556, 1958. Fnglish translation available). 

3 - 21 . (Advanced Problem.) Let ki and /cz be the stochastic matrices of two given 
binary channels. Find the necessary and sufficient condition for their partial order- 
ing, that is, ki D kz. 

Hint: Read the section on partial ordering in the Appendix; also see R. A. Silver- 



CHAPTER 4 


ELEMENTS OF ENCODING 


Thus far, coding offers a most significant application of information 
theory. The maierial presented here strongly relies on the content of 
Chap. 3. In this chapter, some of the fundamental theorems of informa- 
tion theory will be introduced. The noiseless encoding theorem and the 
fundamental theorem of discrete noiseless memoryless channels will be 
given in some detail, with several encoding procedures treated as applica- 
iions of these theorems. A heuristic proof for the fundamental theorem 
of discrete memoryless channels in the presence of noise will be discussed. 
The formal proof for the latter theorem requires some further knowledge 
of probability theory beyond the contents of Chap. 3; conseciuently, it 
will be deferred until Chap. 12. 


o — ^ 

Source 


Channel 


(a) 


Receiver 


- Channel 

Source Receiver 

[b) 

Fk;. 4-1. A simplified model of a commiinieiitiori system id) without encoder-decoder; 
(b) with encoder-decoder. 


Decoder 


o 


Q 


Encoder 


4-1. The Purpose of Encoding. The word encoding, like several other 
common terms of communication engineering, such as detection and 
modulation, is a descriptive word with a broad meaning. It is frequently 
used in a large variety of cases as a transformation procedure operating 
on the input signal prior to its entry into the communication channel, 
the main purpose of coding being, in general, to improve the “efficiency” 
of the communication link in some sense. This definition is, of course, 
unnecessarily broad and vague. In the present work we shall confine 
ourselves to a much more restricted definition which will be described 
later. Consider the basic elements of a communication setup as shown in 
Fig. 4-la, 


131 





DTSniKTK SCHEMES WITHOUT MEMORY 


]:V2 

]\y an indc'pcnulcni source we moan here a device that selects messages 
at random from a discrete message ensemble with prescribed probabilities. 

. . . ,m.v| 

p(wi), p(w2l, . . . , 

In the development of this chapter we assume that successive messages 
are selected independently; that is, the source has no memory. Later 
on in C'hap. 1 I we shall discuss sources with memory where the selection 
of a message is affected by the selection of some of the previously trans- 
mitted messages. ' 

The channel is assumed to be discrete and without memory. Its 
b(‘havior is specified by a finite conditional probability matrix also 
r(‘ferr(Hl to as a chaiiiud matrix. The channel of communication usually 
deals with symbols of some specified list. This list is generally referred 
to as the alphahet of the communication language. The following 
terminology is suggested for our subseriuent work. 

Letter^ sijinhol, nr character Any individual member of the alphabet set 
Message or vmrd A finite secjuence of letters of the alphabet 
Length of a word The number of letters in a word 

Encoding or enciphering A procedure for associating words constructed 
from a finite alphabet of a language with given words of another 
language in a one-to-one manner 

Decoding or deciphering The inverse operation of assigning words of tlu' 
second language corresponding to given words of the first language 
L niquely decipherable encoding or decoding The operation in which tlu^ 
correspondence of all possible sequences of words between the two 
languages without space marks between the words is one-to-one 

Thus, encoding is a procedure for mapping a given set of message^s 
[77h,W2, • . . onto a new set of encoded messages [ci,C 2 , . . . ,cn| 

so that the transformation is one-to-one. Alsf), generally by encoding we 
wish to improve the ‘^efficiency” of the “transmission.” It is, of course, 
possible to devise codes for a special purpose (such as-secrecy) without 
relevance to the transmission efficiency in our adopted sense. It is also 
possible to resort to codes which do not have a one-to-one association. 
However, our present study will be confined strictly to one-to-one codes 
with an eye to improving some sort of “efficiency” of transmission. 

If an alphabet set is denoted by 

Ml = {ai,a2, . . . ,ai;} 

the sequences aiaia 2 , anOi, and a 2 a 2 a 2 a 2 will be referred to as words of this 
language. The lengths of these w^ords are three, two, and four symbols, 
respectively. Similarly the set of letters {0,11 constitutes what is 



KLEMENTS OF ENCODING 183 

commonly known as the binary alphabet; 001 is a word in the binary 
language. 

By speaking of more efficient encoding, we agree to refer to encoding 
procedures that improve certain ^^!Ost functions.” Perhaps the simplest 
cost function is obtained when we assign a constant cost figure t, to each 
message m,; U can be the duration or any other cost factor. Then the 
average cost per message becomes 


N 

= X pi"*'! ■ 

Obviously, the most efficient transmission is the one that minimizes the 
avc^rage cost /?«. In this chapter we confine ()ursolv(\s to the simplest (‘-ase 
when all symbols have idcmtical cost. Thus, the average' cost per message 
becomes proportional to the average of rt,, the numlx'j’ of symbols per 
m(\ssage (or the average length of message's L) : 

N 

Ri = L = ^ p\in,]'th ti = n, '/ = 1, 2, . . . , N (4-la) 


An increase in transmission efficiency can be* obtained by proper 
encoding of messages, that is, assigning new seipiences of symbols to each 
message m, so that the statislic.al distribution of the new symbols reduces 
the average word length L. 

The efficiency of the encoding procedure can be defined if, and only if, 
we know the lowest possible bound of L. Of course, if such a lower bound 
does not exist, the term efficiency will be meaningless. Thus, iin impor- 
tant question arises here. For a given set of messages and a given alpha- 
bet, what is the lowest possible L that can be obtaiiuid? In this chapter 
we show that, subject to certain restriction on the encoding rule, the 
lower bound for L is II{X)/{]og />)), where Ii{X) is the entrojiy of the 
original message ensemble and D the number of symbols in the encoding 
alphalx't. For the time being, we accept the following definilion for 
efficiency of an encoding procedure; the ratio of the aveiage information 
per vsymbol of encoded language to the maximum possible average infor- 
mation per symbol, that is, 


Fifficiency 


//(A) 

L 


;log J) 


L log D 


(4-2) 


If a number of messages are encoded into new words taken from a /^-sym- 
bol alphabet, the maximum possible information per symbol supplied 
by an independent source will be log D. If the encoded words have an 
average length L, then the entropy per symbol is H(X)/L. Thus 



VMS 


mSCHETK SCHEMES WITHOUT MEMORY 


\M] = . . . jITIn]. Let a memoryless* finite channel be 

specified by its conditional probability matrix f/M/yjli’il] (noise matrix). 
Let /) be a finite alphabet [ai,a 2 , . . . ,«/)]. Then an encoding procedure 
is a technique for associating a code word Ck consisting of a sequence of 
letters from [D] to every nik G in a one-to-one manner. We generally 
associate a transmission cost with symbols of the new language and search 
for encoding techniques that reduce the over-all transmission cost or 
achieve more reliable transmission through the channel. Therefore, 
encoding is in a sense a means of matching the source and the channel for d, 
more ^^efficient” joint operation. ; 

For our purpose, the most important kinds of (;od('s are those that do 
not retiuire spacing. That is, if irik is encoded in (\^ \]wn any string 


is unicpiely decoded as 


[. . . m,km,mi . . .] 

[. . . CuC.Ci . . .J 


Ck)des with this property are referred to as separable codes or uniquely 
deeipherable codes. Obviously, the common English words are not 
separable. For example, if the three separate words ‘‘found,” “at,” 

and “ion” are transmitted without 
separation they form a different 
word, “foundation,” which may not 
be implied by the three individual 
words. A sufficient condition for 
unique decipherability is that no 
encoded words can be obtained from 
each other by the addition of more 
letters. That is, the code should have the prefix property (also called 
irredueibility) . It can be seen that irreducibility is a subproperty of 
uni Clue decipherability. This is illustrated by examples below: 


Uniquely 

decipherable 

codes 


“Irreducible codes 

Fici. 1-2. TIk' iliJiKnim points out that 
irrrduc’ihh' oorlos aro a, subclass of the 
uninuoly docif)horabl(‘ codecs. 



r, = 1 C 2 = u) 


The two words satisfy uiiiciue decipherability but not irreducibility. 
The same is true for the words Ci = 1, C 2 = 10, CJ = 100. In both 
these cases any string of words without spacing is uniquely decipherable, 
provided that an appropriate delay time for examining the string as a 
whole is allowed. The reader may wish to know when a uniquely deciph- 
erable encoding procedure can be devised. He would also be interested 
fo know when several such schemes may exist and how one may find an 
enc.oding procedure with high efficiency. The answers to these questions 
will be presented in a systematic way. But first a few words on the 

* A channel is said to be memoryless if the effect of the noise on the input letters is 
independent of the sequence of previously transmitted letters. 



ELEMENTS OF ENCODING 137 

significance of encoding are in order. There are several justifications for 
including some encoding technique in the present discussion : 

1. As yet, encoding techniques seem to provide a most, direct applica- 
tion of information theory. 

2. Binary encoding is of great importance in computing machines, 
telephones, and automata. Therefore, th(‘ particular case of D — 2 
offers an interesting practical opportunity for encoding. 

3. The two basic central theorems of information theory suggest that, 
in the presence or absence of noise, it is possibles to approach the limit of 
efficiency, that is, to transmit at a rate less than or etpial to the channel 
capacity with an arbitrarily small error probability. These theorems are 
essentially of a mathematical nature. Their proofs are rather t,(»dious. 
Coding theory allows us to illustrate the significance of tlu'se theorems 
without undergoing an otherwise laborious mathematical ch'velopment. 
The content of these theorems will be illustrated by several encoding 
techni(iues. 

4. I'inally, the mathematical tools reciuired for rescairch in the field of 
encoding seem to be commonly available. Thus it seems that further 
development in this area will be forthcoming The reader may find it 
worthwhile to become ac.ciuainted with the elements of this new fi(‘ld. 

4-2. Separable Binary Codes. When separability is the only con- 
straint, the following simple encoding procedur(‘ may be employed. 
Divide the message set S into two arbitrary but nonempty subsets S\ 
and ^2. 

[?n 1,7712, . . . ,770 

Assign a 0 to all messages in and 1 to message^s in S2- Now coji- 
tinue with the partitioning of Si 
into subsets >Sii and ^^2. All mes- 
sages in Sn will have codes starl- 
ing with 00, those in Sn will have 
codes starting with 01, and so on. 

The partitioning should continue' 
as long as the subsets contain more 
than one message. The tree of Vig. 

4-3 is an example of this partitioning 
process. “■''''■ 

If the subset, say S12U12, contains a single message, then the code 
010001 is associated with that message. If, say, #821 contains a unique 
message, the associated code will be 10. It is easy to see that these codes 
have the prefix property as no path leading to a vertex can be a subset of a 
longer path leading to another vertex. Thus no word is derived by the 




138 


DISCRETE SCHEMES WITHOUT MEMORY 


addition of digits to shorter words. The partitioning of messages can 
be done in a variety of ways. For instance, we may partition one message 
at a time, such as 

51 = mi 0 

52 = [m2, . . . ,mjv] codes starting with 1 

521 = m2 10 

522 = [mg, . . . ,mjv] codes starting with 1 1 

5221 = m3 110 

5222 = [rriAj . . . ,mjsr] codes starting with 111 

Generally, the efficiency of a code is of special consideration. For this 
reason, the partitioning can be more convenic'iiily achieved in the message 
probability space. For instance, if we wish the probability of the occur- 
rence of 0 and 1 in the encoded messages to be not too unecpial, it is 
logical successively to partition the messages into two more or less ecjui- 
probable subsets. Thus, the problems of separable encoding amount to 
devising appropriate partitioning schemes for the message space. This 
concept is put in focus in the succeeding sections. 

4-3. Shannon-Fano Encoding. This method of encoding is directed 
toward constructing reasonably efficient separable binary codes for 
sources without memory. Let [X] be the ensemble of the messages to 
be transmitted and [P] their corresponding probabilities: 

[X] = [.ri,3'2, . . . ,x„] 

[P\ = [Pl,P2, . . . ,Pn] 

It is desired to associate a setiuence Ck of binary numbers of unspecified 
length rik to each message Ja such that : 

1. No sequences of employed binary numbers Ck can be obtained from 
each other by adding more binary terms to the shorter sequence (prefix 
property). 

2. The transmission of the encoded message is J ‘reasonably” effi- 
cient, that is, 1 and 0 appear independently and with (almost) equal 
probabilities. 

The first constraint eliminates any ambiguity in the receiving end and 
guarantees a one-to-one correspondence between any set of original 
messages and the corresponding set of encoded messages without the 
necessity of spacing between words. (This requirement is called prefix 
constraint.) The second constraint ensures the transmission of almost 
1 bit of information per digit of the encoded messages. It will be shown 
that under favorable circumstances it is possible to have 1 bit of informa- 
tion per transmitted encoded digit, that is, 1 and 0 may appear with 



ELEMENTS OF ENCODING 139 

equal probability. The Shannon-Fano encoding procedure will be 
illustrated by the following example: 


Messages 

Probabilities 

Encoded messages 

Length 

Xx 

0.2500 

0 0 

2 

X2 

0.2500 

0 1 

2 

Xz 

0.1250 

10 0 

3 

Xi 

0.1250 

1 0 1 

3 

Xa 

0.0625 

110 0 

4 

Xe 

0 0625 

110 1 

4 

Xi 

0 0625 

1110 

4 

Xg 

0.0625 

1111 

4 



Average l(‘riKth 

2 75 


The messages are first written in order of nonincreasing probabilities. 
Then the message set is partitioned into two most eciuiprobable subsets 
{Xi} and {X 2 ) . A 0 is assigned to each message contained in one subset 
and a 1 to each of the remaining messages. The same procedure is 
repeated for subsets of {Xi] and {X 2 }; that is, |Xj) will be partitioned 
into two subsets {Xu} and {Xi 2 j. Now the code word corresponding 
to a message contained in Xn will start with 00 and that corresponding 
to a message in A"i 2 will begin with 01 . This procedure is continued until 
each subset contains only one message. Note that each digit 1 or 0 in 
each partitioning of the probability space appears with more or less equal 
probability, independent of the previous or subsequent partitioning; 
therefore the second requirement is also fulfilled. The entropy of the 
original source and the average length of the encoded messages (average 
number of digits per message) are 

H = -m log yi + yi log yi + yi log Ke) = bits 

L = 7:P{x^]n^ = M X 2 + K X 3 + M X 4 = 

Since each encoded message consists of sequences of independent binary 
digits, the entropy per digit for the encoded messages is 1 bit, that is, 
the efficiency of the transmission of information is 100 per cent. The 
encoding procedure is therefore said to be an optimum procedure mini- 
mizing the average length of messages. No other encoding procedure 
satisfying the above requirements can be found that leads to a smaller 
average number of digits per encoded message. 

For deriving the most efficient code by this method, it is necessary 
that the message probability space can be repeatedly partitioned into 
two cquiprobable subspaces so that we finally reach the situation where 
each message corresponds to only one partitioned subspace. The 
probability of the occurrence of each message x* must be of the form 


P{x*} = 2-"* 



140 DISCRETE SCHEMES WITHOUT MEMORY 

where rik is a positive integer. The integers rik satisfy the relation* 

N 

y 2-”* = 2-”' + 2'”^ + ■ • • + 2-”^ = 1 (4-3) 

A most parti(iular case^ of this situation arises when 

[P] = [2-S2-^ . . . ,2-‘, . . . ,2--'^, 2-''! (4-4) 

The encoding procedure is unambiguous (one-to-one), the prefix require- 
ment is fulfilled, and the average number of digits in the encoded mei^- 
sages is 

N N ' 

L = I P{x,}n, = - 2 P\.r,\ \nnP\x,\ (4-5) 

For such a message ensemV)le the average entropy per digit of the encoded 
message is L bits, which is exactly the same as the entropy of the original 
message ensemble, that is, 

N 

HiX) = ^ logy^{rifcl =L bits per message 

k=\ 


The efficiency of the transmission is 



Fig. 4-4. Example of Shannon-Fano 
encoding procedure. 


expected. For instance, Fano (I) 
ciency remains very close to 100 p( 


100 per cent, and the encoding is thus 
an optimum encoding procedure. 
The tree diagram of l^ig. 4-4 illus- 
trates this encoding procedure. 

For practical purposes, it should 
be noted that, when the probability 
matrix of the message is such that 
the above successive eciuiprobable 
partitioning is not possible, the 
Shamion-Fano code may not be an 
optimum code. However, if the 
above requirement is “approxi- 
mately'^ satisfied, then “reasonably’^ 
efficient encoding procedures can be 
gives an exfimple for which the effi- 
T cent. 


Example 4-1. Apply the Shannon-Faiio encoding procedures to the following 
message ensemble: 

[A'l 2*5, J*s,X7,X8, J^gl 

[p] = [0.49,0.14,0.14,0.07,0.07,0.04,0.02,0.02,0.01] 

* It is possible to have several equiprobable messages. In that case, would be 
the same for all these messages; that is, if x* and xt + i are equiprobable and the Shan- 
non-Fano method strictly applies, then n* = nk-^i. 



ELEMENTS OP ENCODING 


141 



(c) id) 


Fi<; K4-1 

Solution. Although the proh.’ibility matrix is not identical with Eq. (4-4), succes- 
sive partitioning can be efficiently achieved. See Figs. E4-la, E4-lf;, and E4-lr. 

n{X) = -[0.49 log 0.49 +0.281og0.11 + 0.14 log 0.07 + 0.04 log 0.04 

-h 0.04 log 0.02 + 0.01 log 0.01] 

L = 0.49 X 1 + 0.28 X H- 0.18 X 4 + 0.02 X 5 + 0.03 X 0 = 2.33 


Eflieiency 

mx) 

2.33 

\J enmge 

Code 

Xi 

0 


100 

X3 

101 

Xk 

1100 


1101 

Xe 

1110 

X7 

11110 

Xh 

111110 


null 


The tree diagram of J'ig. E4-l(i exhibits the general procedure of the Shannon-Fano 
code. 

Example 4-2. Apply a ternary partitioning technique (similar to the Shannon- 
Fano procedure) for encoding the following messages in codes using an alphabet 
[^^il,2]. Find L and the code efficiency. 




142 


DISCRETE SCHEMES WITHOUT MEMORY 


|m| * 

P\m] - 


mi 

0.375 

0 

m2 

0.167 

1 0 

mj 

0.125 

1 1 

m^ 

0.125 

2 0 

mb 

0.125 

2 1 

nib 

0.083 

2 2 


The probability of the occurrence of 0, 1, and 2 can be directly computed. 

PIOI = 
p|2|=^‘’3r, 

Hence, L = 2p|m, jw, = = I.b2 

H(X) = 2.388 

lOfliciencv = = 0 f )4 

^ 1.62 X 1.584 

4-4. Necessary and Sufficient Conditions for Noiseless Coding. 

Given a discrete memoryless source, a noiseless channel, and an encoding 
alphabet with D symbols, what is the highest rate of transmission of 
information supplied by the source across the channel? Let the entropy 
of the source be H{X) and the messages be encoded in words with an 
average length L; then the entropy of the encoded information is sup- 
plied at a rate 

H(X) 

L 


Since the maximum of this rate is log D and it occurs when all D symbols 
are equiprobable, it appears that the lowest value of L is 


H{X ) 
log D 

However, it is not obvious that: 

1. For a given set of message probabilities, it is possible to devise 
codes with preassigned word lengths. 

2. There exist uniquely decipherable codes leading to 


L > 


H{X) 
log D 


and no such codes exist with L < H{X)/ {log D). 

The object of this section is to discuss part 1, that is, to derive the neces- 
sary and sufficient conditions for the existence of such noiseless encoding 
procedures, Part 2 will be discussed in a later section. 



ELEMENTS OF ENCODING 


143 


Let {X} be the information source ^^ith N messages [x\yX 2 j . . . ,Xn] 
and [D] the coding alphabet [ai,a 2 , . . . ,aDl. Our problem is to find a 
one-to-one correspondence between every element of {Xj and a 
sequence of a^s having, say, digits, with the restriction that none of 
the encoded messages can be obtained from each other by adding a 
sequence of a’s to the shorter encoded message. Thus, [ni,n 2 , . . . ,n;v], 
the length of the encoded words, cannot be arbitrary integers. They 
must satisfy certain realizability conditions if the prefix requirement is 
to be met. For instance, in the afore-mentioned partitioning procedure 
[Eq. (4-2)} we found that a code exists with 

[n,,W2, . . . ,wa,] = [1,2, . . . ,N] 

The following theorem has been derived in Szilard, Kraft, Fano (I), 
Mandelbrot (I), Sardinas and Patterson, McMillan (I), and Feinstein (1). 
The proof given here is based on the latter two references. 

A Noiseless Coding Theorem. The necessary and sufficient condition 
for existence of an irreducible noiseless encoding procedure with speci- 
fied word length [ni,n 2 , . . . ,njv] is that a set of positive integers 
[rii,n 2 , . . . jUn] can be found such that 

N 

X < 1 (4-6) 

Proof. Clearly two encoded messages Xt and Xk can have the same 
length, that is, n^ = n^. Let Wx be the number of encoded messages of 
length n, and note that the number of encoded messages with only one 
letter cannot be larger than D. 

Wi< D (4-7) 

The number of encoded messages of length 2, because of our coding 
restriction, cannot be larger than 

TFa < (D - Wi)D = - WiD (4-8) 

Similarly, 

Wi < [{D - Wi)D - W2]D = - WiD^ - W 2 D (4-9) 

Finally, if m is the maximum length of the encoded words, one concludes 
that 

Wm< - TF 2 D — 2 Wm-^D (4-10) 

Dividing both sides of this inequality by Z)”* yields 
0 < 1 - IFiD-i - TF 2 D -2 - . - ■ - 
£ WxD-* < 1 


or 



144 


DISCRETE SCHEMES WITHOUT MEMORY 


It may not be obvious that this condition is identical with Eq. (4-6). 
But note that 

m > rii 1 = 1, 2, . . . , AT 

m 

and ^ W^D~^ means the sum of “the numbers of all sequences of length 
1 = 1 

i multiplied by where the summation extends from 1 to m. 

Let us examine what is implied by the above inequality. We can 
rewrite it in the following way: 

m 


+ /]» + ■■ ■ + j]„. ( 4 - 11 ) 

bC" 

Each bracketed expression corresponds to a message j\, and therefore the 
total number of terms is W. 


ir, w, 

+ w, + - '+ Wm = N 


(4-12) 


The terms in Wk correspond to the encoded messages of length K. These 
latter terms can be considered as when th(‘ summation takes place 

over all those terms with = k. Therefore, by a simple reassignment 
of terms, we may equivalently write 


N 


I W,D-^ = I D- 


(4-13) 




1 = 1 


Thus 


I WJ)-’ = I D-". < 1 

J = 1 1=1 


The desired set of positive integers fni,n 2 , . . . ,njv] must satisfy the 
inequality of Eq. (4-6). This proves the necessity requirement of the 
theorem. 

As an example, let 


[X] = [X,,X2,X3,X4,X5,X6,X7] 

Assume that after encoding we get a set of messages with the following 
lengths: 

ni = 2 71-2 = 2 na = 3 ^4 = 3 ns = 3 ne = 4 n? = 5 



ELEMENTS OF ENCODING 


145 


Therefore 

= 0 ITj = 2 TT, = 3 Tf4 = 1 W, = i F* = 0 W 7 = 0 
The sets of desired integers n, and F, are thus 

M = [1,2,3,4,5,6,71 [F.l = [2,2,3,3,4,5,0,0] 

m — r» 

and ^ W,D-^ = 2 -^_ + S -^, + i -^, + ± 

1 

-L + -i-4--! + J 4 J 4--L 

jyi -r- jyi -r -T jy, -r t jj^-r 

The two sums are obviously eciual. 

Now we show that the condition 

m 

£ W.D-i = F,/)-' + F,;)-^ + • • • + Wn.D-”' < 1 

.7 = 1 

is sufficient for the existence of the desired codes. As all terms in Eq. 
(4-L‘i) are positive, each term or the sum of a number of these terms must 
be positive and less than 1 . Therefore we conclude that 

WiD-^ <1 or TTi < i; (4-14) 

and WiD-^ + F 2/)-2 < 1 or 1^2 < D(/> - Wi) (4-15) 

and so on. But these are exactly the conditions that we have to satisfy 
in order to guarantee that no encoded message can be obtained from any 
other by the addition of a sequence of letters of the encoding alphabet. 
As an application of the foregoing theorem, let 1) be a binary set, that is, 
A = [ai,a 2 ]; then the encoding theorem requires that 

N 

^ 2-”* < 1 (4-16) 

1=1 

As an application of the foregoing, consider the existence of a separable 
code book having N words of equal length n. The noiseless coding 
theorem suggests that such codes exist if 

AT 

y /)-n = NO-- < 1 (4-17) 

jfc -1 

log AT < n log D 

This latter relation between N, n, and D guarantees the existence of the 
desired codes. In the particular case of X) = 2, if we assume the further 




146 


DISCRETE SCHEMES WITHOUT MEMORY 


constraint that the words of the code book could be ordered in such a 
way that every two consecutive words differ by only one binit, the code 
is referred to as the Gray code. For instance, for n = 2 and AT = 4, we 
have 

00 , 01 , 11 , 10 

Gray codes are of some practical value in computing machines (for 
example, in analog-to-digital conversion as described in the references in 
the footnote).* 

Example 4-3. Find the amallost number of letters in the alphabet (number 
for devising a code with a prefix property such that 

m = 10,3,0,5] 

Devise such a code. 

Solution. The realizability condition is 


3D 2 + 5D' ^ < 1 


The inequality is satisfied for 

> 3+y^> 

2i 

The smallest permissible value is D = 3. That is, no binary code can be devised 
with the above constraint. To devise such a code, let the alphabe;t be [0.1.2]: then 
one of several encoding procedures is 

mi = 00 

m2 = 01 

W3 = 02 

m4 = 1000 
mo = 1001 
me = 1002 
^n^ = 2000 
ms = 2222 


Example 4-4. Show all possible sets of binary codes with the prefix property for 
encoding the message ensemble 

[mi,m2,m3l 

in words not more than three digits long. 

Solution. All possible desired sets can be obtained from the inequality 


or 


+ W22-^ -h UI82-3 < 1 

wi + Wi Wi ^ ^ Wk>0 A; = 1, 2, 3 

Awi -h 2w2 + iCa < 8 

-I- iC2 -j- iCa = 3 


• S. H. Caldwell, "Switching Circuits and Logical Design,” John Wiley & Sons, 
Inc., New York, 1958; M. Phister, Jr., "Logical Design of Digital Computers,” John 
Wiley & Sons, Inc,, New York, 1959. 



ELEMENTS OF ENCODING 


147 


The possible sets of codes correspond to: 

uJi = 1 Wi ^ \ uJa = 1 

Wi ^ 1 lya = 2 I/;, as 0 

iTi =« 1 1^2 * 0 Ws ™ 2 

Wi ^ 0 1/^2 = 3 itJj = 0 

Wi ^ 0 Wz — 2 iTa = 1 

lOi = 0 102 = 1 lOa = 2 

lOi = 0 Wz — 0 = 3 

The corresponding codes can be found without difficulty. 

4-6. A Theorem on Decodability. In Sec. 4-4, we have given the 
necessary and sufficient conditions for the existence of a set of irreducible 
code words of specified length. Actually a stronger theorem (‘.oncerning 
the unique decipherability (not 
necessarily irreducible code words) 
has been derived by McMillan (1).* 

McMillan’s Theorem. Let 
[mi,m 2 , , . . ,mjvl be a sequence of 
messages encoded in uniciuely deci- 
pherable words of respective symbol 
length [ni,n 2 , . . . ,njv] taken from 
a finite alphabet [ai,a 2 , . . . 

Then 

N 

2 D-”- < 1 (4-18) 

t-i 

Proof. Let I be the largest ele- ^la. 4-5. The disk of unique decipher- 
.nr 1 j ability in the complex plane, 

ment ol lni,n 2 , . . . ,nArJ, and Wtj as 

before, the number of distinct words of length n^. Then it is desired to 
prove that 

2 < 1 (4-1 8a) 



McMillan employed an interesting method for proving Eq. (4-1 8a). Let 


We prove that 
for 


Q{z) = 2 


Q(Z) < 1 

0 < 2 < 


(4-19) 


Let JV(fc) be the number of distinct words of length fc taken from the 
alphabet [a]. The decodabihty condition requires that 


Nik) < D>‘ 


* Also Mandelbrot (I) and L. Kraft in an unpublished 1949 MIT thesis. 


(4-20) 



148 DISCRETE SCHEMES WITHOUT MEMORY 

Now consider the infinite series 


F{z) = 1 + N{\)z + AT ( 2)22 + ■ ■ ■ (4-21) 

This series converges within the disk \z\ < 7_)“k (Why?) 

Next we look into the property of unicjue decipherability which permits 
writing for a sequence of length k 

N{k) = WiN(k - 1) + uhNik - 2) + • ■ ■ + w,Nik - 1) (4-22) 

It is easy to see that this recurrent formula holds if we let 

N{i)) = I and A'(/0 = 0 for h < 0 \ 

Note that 

00 CO I 

l\z) - 1 = X = X X ^ 

A=l 1=1 

Since F( 2 ) is an analytic rational function in \z\ < D ^ and the denominator 
is a continuous function, with 1 — (^>(0) = 1 , it follows that 1 — Q( 2 ) has 
no zeros in the disk \z\ < I) ' ainl Qiz) < 1 for i) < z < l) '\ 

4-6. Average Length of Encoded Messages. The theorem of Sec. 
4-4 can be successfully ernploye'd for obtaining a lower bound for the 
average length of encoded messages. In Sec. 4-3, we pointed out that 
the average length L of the encoded messages, that is 

L = :S7^|.r.)n^ (4-24) 

gives a measure of the efficiency of coding. In this section it will 
be shown that L cannot be decreased beyond a certain limit. More 
specifically ; 

Theorem. Let {X} be a discrete message source, without memory, 
and Xi any message of this source with probability of transmission P{xi], 
If the {X\ ensemble is encoded in a sequence of uniciuely decipherable 
characters taken from the alphabet [ai,a, . . . ,a/ 3 j then 



Proof. The proof of this theorem lies in the following lemma: 
Lemma. Consider two sets of nonnegative numbers {p^) and {g,}, 
such that 

71 n 

X p* = 2 = 1 

» «= 1 » * 1 



ELEMENTS OF ENCODING 


149 


Then 


- 2 g. log q. < - X 3' '"8 P‘* 


(4-20) 


Now wc shall apply this lemma to the set of nonnegative integers 

D-n. 


Qi - at 

I V)-. 


(4-27) 


k^l 


where 7) is a positive integer and J?! the secpiinice of [1,2, . . . ,A^J. 
Indeed in E(|. (4-27) t.hc denominator is a nonnegative nninbcr less than 
or etjual to 1 and 

N 


= 1 


N 


v„ V 


N ~ 

1 

k^y 

Therefore we may write 

k = l 

N 

N 

}I{X) - - 2^P\X,\ log7'l.r,| < - 

i~ 1 


.V 

.V 


I 


k= 1 


- log ^ I> ^ loK (4-28) 


A - 1 

N N 


or 


fliX) < log X + X I'\A\n, lojT ]) 
1-1 


Applying the theorem of See. 4-4 yields 


H{X) < log 1 + log I) V 


or 

* For oxample, let 
Then 


L > 


lliX) 

log D 


[p.l = ['2,' V i] 
l9.1 = [H,%,'i\ 


%-% loK + '2' < -I- ^4 -H 

1 ?^ - % loK 5 < m 
7 < 5 log 5 


(4-29) 


Or 

as, evidently, 

The proof is left for the reader. (Use convexity of x log See also Prob. 3-10.) 



150 


DIBCHETE SCHEMES WITHOUT MEMORY 


This result is rather interesting as it clearly shows that even in the absence 
of noise no uniquely decipherable encoding procedure can be devised 
such that its average message length is Icvss than a fixed number that is 
the ratio of the source entropy and log D. (Log D is the maximum 
possible entropy associated with the selected alphabet containing D 
characters or the capacity of the coding alphabet subject to the above 
constraint.) This lower bound is not generally achieved unless the 
Tij are all appropriately chosen integers. In such a case we obtain an 
optimum encoding procedure, that is, 


Lo = 


H(X) ^ _ 

log D capacity of coding alphabet 


source entropy 


\ 


(4-30) 1 


While L„, the lower bound of L, may not always be reached by an encod- 
ing procedure, it is always possible to give an obtainable bf)und for L. 
In fact, if we let 


log P{ xi,\ 
log D 


< Uk < — 


log /’In) 

fog’/r 


+ l fc = 1, 2, 


N D> 2 


we rest assured that 


log 


N 


Plxk]ni < 1 + 


H(X) 
log J) 


(4-31) 


Note that owing to the above inecjuality and its consequence 

P{xk] > 

the noiseless coding theorem is satisfied. Thus uniquely decipherable 
codes exist. Shannon has developed a binary encoding scheme based on 
Eq. (4-31) which will be discussed in the next section. 

The bound given by the theorem of this section is based on the assump- 
tion of decipherability of the encoded messages. If this strong require- 
ment is removed, the average length may be reduced. As an example, 
consider the case 

Xi-> 1 
X2 — ► 0 
Xa-> 100 


You can show as an exercise that, with a proper selection of probabilities, 
one may violate the lower bound of Eq. (4-31). 

Example 4-6. The output of a discrete source, 


1^1 \ XifXifXi/XifXbjXti] 

P\X\ ^ 



ELEMENTS OF ENCODING 


151 


is encoded in the following six ways: 



Cl 

c. 

C3 

Ca 

Ch 

c. 

Xi 

0 

1 

0 

111 

1 

0 

Xi 

10 

on 

10 

no 

01 

01 

Xs 

no 

010 

no 

101 

0011 

on 

Xa 

1110 

001 

1110 

100 

0010 

0111 

Xi 

1011 

000 

lino 

on 

0001 

01111 

X6 

1101 

no 

111110 

010 

0000 

onni 


(o) Determine which of these codes are uniquely decipherable. 

(6) Determine those which have the prefix property. 

(c) Find the average length of each uniquely deciphernbie cf»de. 

(d) Does any one of the above codes give minimum average length? 

Solution 

(а) By direct inspection one finds that Ca, C4, Cs, and Co are uniquely detaphcrable. 
While it is clear that Ci and are not uniquely decipherable, we may wish to apply 
McMillan’s realizability criterion. 

For Ci: + 2 "* + 2-= + 8 ■ 2 '^ > 1 

For Ca*. 2 -i + 5 • 2 -» > 1 

Thus such uniquely decipherable codes cannot exist. 

(б) C3, C4, and Ci have the prefix property, but Co does not have sii(;h a property. 

W £i H "I" + Ke + + He 

£4 = 3 £5 = 2 £0 = 21>^ 

(d) The entropy of the source is 

ff(X) * -H log H - H log K - Ke log Ks = 2 bits per symbol 

According to Eq. ( 4 - 29 ), the average length of a uniquely decipheral)lp code cannot be 
less than H{X), Thus Ci is a code that achieves minimum average length. 

4-7. Shannon’s Binary Encoding. Shannon has suggested a binary 
encoding procedure based on Eq. (4-31). First we must reassure our- 
selves whether such codes exist. For this, we employ the noiseless cod- 
ing theorem. Equation (4-31) yields 

2-n. < A; = 1, 2, . N (4-32) 

If these inequalities are all satisfied, then 

N 

2 2 "”* < 1 

fc -1 

Thus we are sure that the desired code exists. Note that such codes will 
have the interesting property that their average length is constrained by 

H{X) <L< HiX) + 1 (4-33) 



]52 


DTSrUlOTK SCHEMES WITHOUT MEMORY 


The following; steps describe this method: 

1. Write down the message ensemble in the order of nonincreasing 
probabilities, say, 

fa-,,X2, . . . ,a-Af| 

P{.rA > r\.r,\ > > (4-34) 

2. Compute the se(iuences 

«i = 0 

a2 = \ 

= P\x.>\ + /'{.fil = P{.r 2 } + a 2 (4-35) \ 

0(4 = + Of\ 


3. Determine the set. of integers which is the smallest integer’s solution 
of the ineciualit ies 

> 1 7 :- 1,2,... (4-30) 

4. Expand the decimal numbers a, in binary form to n* places: that is, 
neglect the expansion beyond the n, digits.* 

Proof. To show the validity of the method, first we note that owing 
to the decodability theorem an encoding procedure with the prefix prop- 
erty must exist. In the second place, one observes that the numbers 

• The soluiioiiH for the a of Eqs. (4-35) will be unmbers (‘xp)r(\ssod in decimal form. 
To expiresH a decimal number N in binary form one must determine th(‘ set of such 
that 


AT = ... 4. -, 3 ( 2)3 4 ^,( 2)2 4 ^,( 2)1 4 ^„( 2)0 4 T ..,( 2)^1 + T _ 2 ( 2)-2 4 • • ■ 


where Tfc (fc =0, ±1, ±2, . . .) is either 0 or 1 . The binary form is usually written 
in the following abbreviated manner, . . . t 3 T 2 T|Tot-it_ 2 t. a . . . , where the “point" 
IS called a binary 'point. Any sinRle Tk is called a binary digit. It has been suggested 
[Clolay] that binary digit be abbreviated to bimt. This convention would help to 
destroy the erroneous connection between a “bit" (a unit of binary information) and 
a “binary digit" (a term in a binary number). (Note th(‘ piarallelism between the 
binary and decimal forms. By letting r* = 0, 1,2, . . . , t) and lepilachig the 2’s in 
the piarontheses with lO’s we have- a number expressed in decimal form. The “point" 
is then called the decimal pioint.) Consider the following illustrative examples. 


Decimal form 

8 00 
7.00 
5 50 
0.25 
0.40 


Binary form 

1000 00 
0111.00 
0101.10 
0000 01 
0000.0110011 


M, J. E. Colay, Proc. IRE, vol. 42, no. 9, p. 1452, September, 1954. 



ELEMENTS OF ENCODING 


153 


ak correspond to ordinates of CDF as shown in Fig. 4-0: 

0 = ai < a2 < a!3 < • • ■ < < an+i = 1 (4-37) 

Now assume that aj, has been expanded to iik place as 

oik 

and afc+i to nfr +1 place as 

oik+i-^ ?7a+i > a/, 

But OfAr+l = Oik + (4-38) 

This ecpiality, written in the binary form, will become 

= .T-1T_*> ■ ■ * (4-30) 

Kec^piug E(p (4-30) in mind, one can seci that the suggesti.'d codes for 
and ak^i will be distinct binary numbers. 



Fig. 4-6. A CDF associated with Sharinoii’s l)iri:iry (‘iiendiriK- 


Example 4-6. 

ensemble : 


Solution 

Step 1 
Step 2 
Step 3 


Step 4 


Apply Shannon’s encoding pro(;odure to the following message 



IX] 

= lxi,X2,a-3,jd 



[P] 

= 10.4,0.3 

,0.2,0.11 



0.4 : 

> 0.3 > 0. 

2 > 0.1 



Ot2 — 

0.4 an 

, = 0.7 

014 


0.4 

> 2-2 

n\ — 2 



0.3 

> 2-2 

n2 — 2 



0.2 

> 2-3 

UTi = 3 



0.1 

> 2-^ 

714 = 4 


ai 

= 00| 


Xi 

00 

Of 2 

= 0.4 

= 01|1 

X2 

01 

Of3 

- 0.7 

= 101 11 

2- .3 

101 

<*4 

= 0.9 

= lllOl 

Xi 

1110 



154 


DISCKETE SCHEMES WITHOUT MEMORY 


4-8. Fundamental Theorem of Discrete Noiseless Coding. There is 

an important theorem due to C. E. Shannon which states: 

Theorem. Let 4 S be a discrete source without memory with a com- 
munication entropy H(X) and a noiseless channel with capacity C bits 
per message. It is possible to encode the output of S so that, if the 
encoded messages are transmitted through the channel, the rate of trans- 
mission of information approaches C per symbol as closely as desired. 

J^roof. We have already seen that for a given source [X] with iV 
messages the length of each encoded message may be constrained by the^ 
following inequalities : 


log ;> ■ - ■ 


< - 


log P{x.) 
■ log D 


+ 1 


fc = 1, 2, . . . , iV (4-40) 


Furthermore it can be seen that the average length Li satisfies the relation 


H{X ,) - //(Xi) 

log D ~~ ^ log D 


+ 1 


(4-41) 


Now suppose that we consider the source X 2 = Xi 0 Xi, that is, a source 
which transmits independently the following messages: 




X1X2 

X\Xi 

XiXjv 

[X,] = 

X2X1 

X2X2 

X2X3 ■ 

X 2X ^ 


_,XnXi 

XfjX2 

X]SfX\i 

• • XnXn_ 


If it is assumed that the successive messages are independent, the cor- 
responding probability matrix is 


fPIxilPlxil 

P(x,|P{x.l • 

■ P(xi)P{xw| ' 



P|X2lP{x*) 

• Pix2)P|xjvl 

(4-43) 

_P{xArlP|iil 

P|Xjv)P{X 2 ) 

• • P{xjv)P{a:jv}_ 



This source will be referred to as a second-order extension of the original 
source. Now if this message ensemble is encoded, we expect that the 
average length L 2 of the messages of the new source will satisfy the 
relation 


H(X2) ^ f ^ HiXi) 
kTg-D - ^ 


(4-44) 


As the successive messages are independent, one finds 


H{X2) = H{Xi) + H{Xi) = 2i/(Xi) 


(4-45) 


2H(X0 / , ^ 2i/(Xi) , , 


Thus 


(4-46) 



ELEMENTS OF ENCODING 


155 


In a similar fashion we consider the nth-order extension of the original 
source. If the message ensemble of Xm = X\X\Xi • ■ A"i is similarly 

ilT" 

encoded, we conclude that 


or 

Finally, 


H{Xm) . J . IliXM) . , 
l^D ISgD + ^ 

M <Lm<M + 1 

log D log D 


mx,) ^ lm . Hj^i) j_ 
log D - ~M - Tog D M 


(4-47) 


(4-48) 


When M is made infinitely large, we obtain 


lim 


Lm 

M 


log D 


(4-49) 


This completes the proof of the so-called first fundamental coding theorem. 
It should be kept in mind that, while we asymptotically approach the 
above limit, the procedure does not necesvsarily yield a monotoiiically 
increasing improvement. That is, it is possible to have a situation where 


Lm ^ 


1 


M 


M - 1 


Lm~1 


4-9. Huffman’s Minimum-redundancy Code. Huffman has suggested 
a simple method for constructing separable codes with minimum redun- 
dancy for a set of discrete messages [Huffman (I)]. The meaning of the 
latter term will be described shortly. Let [X\ be the message ensemble, 
[P] the corresponding probability matrix, [D] the encoding alphabet, 
and L(xk) the length of the encoded message Xk. Then 

L = E[L{xt)] = f P{x,\L{x,) (4-50) 

Jfc*l 

A minimum redundancy or an optimum code is one that leads to the 
lowest possible value of L for a given D. This definition is accepted, 
having in mind the irreducibility requirements. That is, distinct mes- 
sages must be encoded in uniquely decipherable words with the prefix 
property. To comply with these requirements, Huffman derives the 
following results : 

1. For an optimum encoding, the longer code word should correspond 
to a message with lower probability; thus if for convenience the messages 
are numbered in order of nonincreasing probability, 

P[xi\ > P[x^\ > P[x,\ > > P[xn\ 

tbeq l{n) < L{xi) < L(x,) < ■ ■ • < Lixif) 


(4-51) 

(4-52) 



156 


DISCRETE SCHEMES WITHOUT MEMORY 


Indeed, if Eq. (4-52) is not met for two messages Xk and Xj^ one may 
interchange their corresponding codes and arrive at a lower value of L. 
Thus such codes cannot be of the optimum type. 

2. For an optimum code it is necessary that 

L{xm-\) = L{xn) (4-53) 

If we assign similar code words to xn and a'jvr_i except for the final digit, 
our purpose is served. Any additional digit for xn and Xn-\ unneces-^ 
sarily increases L. Therefore, at least two messages xj^-i and xn should be \ 
encoded in words of identical length. However, not more than D such 
messages could have equal length. It can be shown that, for an optimum 
encoding, no, the number of least probable messages which should be 
encoded in words of equal length, is the integer satisfying the reciuirements 


= integer 2 < ro < 7) 

3. Each sequence of length L(xn) — 1 digits either miLst be used as an 
encoded word or must have one of its prefixes used as an encoded word. 

In the following we shall restrict ourselves to the binary case {D = 2). 
A similar procedure applies for the general case as shown in p]x ample 4-8. 
Condition 2 now requires that the two least probable messages have the 
same length. Condition 2 specifies that the two encoded messages of 
length m are identical except for their lost digits. We shall select these 
two messages to be the nth and (n — l)st original messages. After 
such a selection we form a composite message out of these two messages 
with a probability equal to the sum of their probabilities. The set of 
messages X in which the composite message is replacing the afore-meii- 
tioned two messages will be referred to as an auxiliary ensemble of order 
1 or simply AEl. Now we shall apply the rules for finding optimum 
codes to AEl] this will lead to AE2, AE'S^ and so on. The code words 
for each two least probable members of any ensemble A-EK are identical 
except for their last digits, which are 0 for one and 1 for the other. The 
iteration cycle is continued up to the time that AEM has only two mes- 
sages. A final digit 0 is assigned to one of the messages and 1 to the 
other. Now we shall trace back our path and remember each two 
messages which have to differ only in their last digits. The optimality 
of the procedure is a direct consequence of the previously described 
optimal steps. (For additional material see Fano [IJ.) 

Huffman's method provides an optimum encoding in the describc'd 
sense. The methods suggested earlier by Shannon and Fano do not 
necessarily lead to an optimum code. 



ELEMENTS OF ENCODING; 


157 


Example 4-7. Given the following set of messages and their corresponding trans- 
mission probabilities 


[mi,7ni,nh] 


(a) Construct a binary code satisfying the prefix condition and having the minimum 
possible average length of encoded digits. Compute the eflicieney of the code. 

(/>) Next consider a source transmitting messages 

rriirni m,?7Z2 niitHs 
m-iWi niiyjhi 

mm I rti-mi wijma 

Construct a binary code with the preGx property and minimum average length and 
compute its efficiency. 

Solution 

(a) If the binary code must have the j^refix i)rop(‘rly, then we assign the following 
code: 

vii ► 0 
7112 1 0 

VI. i — ► 1 1 


The average length of tin' code word is 

}'3 • I -b • 2 + ' 3 • 2 = biiiits 
0 and 1 appc'ar with probabilities of and 'b-,, respc^ctively. 

l'jfhci(‘iicy = = b.95 

*3 log 2 


Shannon’s encoding also leads to the samt* n*sult. 
(h) We construct a Huffman code: 


H 
Vb 
Vb 
Vb 
H 
Vb 
Vb 
HI ii 


HI 

30 


H 
H 
H 
H 
H 

HI i| 

H 


)J 


H 
H 
H 

HI i| 

H/ 0 


H 

H 

H 


H 


H 




H 

H 

H 


li 


►H p: 

HI i| 
Hf 0 


HI i| 
H/ 0 


0 


p|0)=l%9 


Hi 0 


-r 


1 1 0 
1 0 1 
10 0 
110 
0 1 0 
0 0 1 
0 0 0 
1111 
1110 



DISCRETE SCHEMES WITHOUT MEMORY 


As all messages are equiprobable, 


Average length 


7X3+2X4 29 

9 “9 


3.22 binits 


Efficiency = = 0-98 


Example 4-8. Apply Huffman’s encoding procedure to the following message 
ensemble and determine the average length of the encoded message. 

{ } “ {x ijX2fX3fX4fXifXnfXTfXiifX9fXio\ 

p[X\ = 10.18,0.17,0.16,0.15,0.10,0.08,0.05,0.05,0.04,0.021 

The encoding alphabet is \D\ = |0,1,2,3|. 

Solution 


xi 0.18 

X 2 0.17 


0.16 

Xs 0.16 0.16\ 0 

Xi 0.15 0.16 1 01 

xs 0.10 0.10 2 02 

X 6 0.08 0.08/ 3 03 

X7 0.05\ 0 30 

xb 0.05 1 31 

X 9 0.04 2 32 

xio 0.02/ 3 33 

L * 0.18 X 1 + 0.17 X 1 + 0.16 X 2 4- 0.15 X 2 + 0.18 X 2 
+ 0,05 X 2 -h 0.05 X 2 + 0.04 X 2 + 0.02 X 2 
= 1.65 

4-10. Gilbert-Moore Encoding. The Gilbert-Moore alphabetical 
encoding is an interesting and simple procedure. While Shannon's 
encoding discussed in Sec. 4-7 was based on Eq. (4-31), the Gilbert-Moore 
method is based on the inequality 

2^-”* < P{xjt} < k = 1, 2, . . . , N (4-54) 

To see that such a code exists, we note that 


0.49 \ 0 
0.18 1 
0.17> 2 

0.16/ 3 


(4-54) 


V < 1 < V 2*-”* 

ifc-i 

N N 

2 J 2-”* < 1 < 4 £ 2-«* 


(4-55) 


Thus the existence of the desired code is guaranteed. Furthermore, such 
codes will have the property of 

^ - rtk < log P{xk] <2 - Uk 
1 - L < -H{X) < 2 - L 
1 + H{X) < L < 2 + H{X) 


or 


(4-56) 



ELEMENTS OF ENCODING 


159 


The following steps summarize this method. 

Step 1.‘ Write down Ahe messages in their specified order. (We 
assume that some “ alphabetic order” has been specified for the symbols.) 

Step 2. Let n, be the length of the encoded symbol x*; choose n, such 
that 

< Plxi] < 22-"‘ t = 1, 2, . . . , AT 
StepZ. Compute the nondecreasing sequence [a:i,a 2 , ■ . 
ai = y2PM 

052 = P{Xl] + HP{X2] 

(4-57) 

a, = P\xi] + P{X2} -h + P{x^-l] + HP\x^] 


(Note that 0 < ai < ^2 £ • * • ^1) 

Step 4. The encoding of the message Xr is given by the binary expan- 
sion of the number a^ to the riith place. 

To prove that the coding possesses the prefix property, note that either 
of the following two inequalities must be true for any two symbols 

{i < j)- 


(o) P{i.) < PM 

(b) PIx,} > PIx,} 


If (a) is valid, then n, > n„ but since 

> «. + }^P(x,} + }'iP\x,} 
otj ^ + 2 “”» 


(4-58) 


we find that the jth code word cannot be identical with the first Uj places 
of the zth code word. A similar conclusion can be reached if (b) is true. 
Thus the code has the prefix property. 

As an example, consider the first four letters of the English alphabet 
and their corresponding probabilities: 


[space,A,fi,C] 

[0.1859,0.0642,0.0127,0.0218] 


The corresponding Ui and ai are found to be 

[4,5,8,7] and [0.09295,0.2180,0.25635,0.2736] 
Thus the desired code is 


Space : 0001 
A: 00110 
B: 01000001 
C: OlOOOU 



160 


DISCRETE SCHEMES WITHOUT MEMORY 


A similar encoding procedure has been suggested by replacing with fit 
fii = ^ 21 -”^ + 2 -^‘ 

and obtaining the first n, digits of the binary expansion of fi^. This 
encoding procedure, which preserves the original message order in a 
binary numbering order and has also the prefix property, is referred to 
as an alphabetical ent^oding. The amount of computation for an alpha-' 
betical encoding is very little, but the existing method for finding the'\ 
alphabetical encoding with the least average cost is rather complex. 

One may wish to apply this latter procedure to the English alphabet 
in its ordinary alphabetical order. The Gilbert-Moore answer to this 
problem is given in Table 4-2. In the code listed in this table, word 
lengths have been shortened to a minimum without losing the prefix 
property. Such codes have beim referred to as the best alphabetic codes. 
The average length of the best alphabetic code can be made reasonably 
close to the best possible average length obtained by Huffman’s technique. 

4-11. Fundamental Theorem of Discrete Encoding in Presence of 
Noise. In Sec. 4-8 we discussed the first fundamental theorem of infor- 
mation theory. It was shown there that, for a given discrete noiseless 
memoryless channel with capacity C and a given source (without mem- 
ory) with an entropy //, it is possible to devise proper encoding pro- 
cedures such that the encoded output of the source can be transmitted 
through the channel with a rate as close to C as desired. In this section 
we wish to extend the foregoing concept to (^over the case of discrete 
channels when independent noise affects each symbol. Jt will be shown 
that the output of the source can be encoded in such a way that, when 
transmitted over a noisy channel, the rate of transmission may approach 
the channel capacity C with the probability of error as small as desired. 
This statement is referred to as the second fundamental theorem of 
information theory. Its full meaning will be minutely restated at the 
end of this section, where a more analytic statement is-derived. 

Second Fundamental Encoding Theorem. Let C be the capacity of a 
discrete channel without memory, R any desired rate of transmission of 
information (/? < C), and S a discrete independent source with a specified 
entropy. It is possible to find an appropriate encoding procedure to 
encode the output of so that the encoded output can be transmitted 
through the channel at the rate R and decoded with as small a probability 
of error or equivocation as desired. Conversely, such a reliable trans- 
mission for i? > C is not possible. 

From a mathematical standpoint, the proof of this fundamental 
theorem and its converse is the central theme of information theory. 



ELEMENTS OP ENCODING 


161 


Subsequent to Shannon's original statement of this crucial theorem, much 
interest was stimulated toward producing a formal proof. After a 
number of years of research, formal proof is now available for channels 
without memory or with finite memory. It seems that further work 
will be forthcoming in the periphery of this basic theorem. Among 
those who have contributed considerably to the formalization of these 
basic theorems are Barnard, Elias, Faiio, Eeinstein, Khinchin, McMillan, 
Shannon, and Wolfowitz. The first complete proof for discrete noisy 
channels is due to Eeinstein. This proof is quite complex and requires a 
number of preliminary mathematical lemmas. Eeinstein’s proof occu- 
pies more than two chapters of his book “Foundations of Information 
Theory” (McGraw-Hill Book Company, Inc., New York, 1958). 

The presentation of such extensive proof is bt^yond the scope of this 
chapter. It is also (questionable if the inclusion of such proof would be 
decisively helpful to the reader who 
may not have an advanced back- ^ 
ground in probability and a pro- 
fessional interest in information 
theory. However, a heuristic proof 
for binary symmetric channels will ^ 
be given which will throw some light jrnj. 4 . 7 . a JiSC. 

on the theorem while aAa)i(ling its 

complicated mathematic>al details. Those interc'sted in a formal proof 
arc referred to the original papers of the aforementioned contributors and 
the material presented in C'hap. 12. 

A Heuristic Proof of the Fundamental Theorem for BSC. Consider a 
s(Hirce S with a message ensemble MJ = [^ 1 , 02 , . . . ^un]. The source 
is assumed to transmit any one of these N messages independently and 
with equal probability. In other words, you may think of it as a source 
which selects its signals completely “at random.” The channel is 
specified to be a binary symmetric channel (BSC), as shown in Fig. 4-7. 

p{o\o} = = p 

PjllOl = P{0\1\ = 1 - p = g 

The encoder must encode each of the N messages of [A] into a string of 
O^s and Es. We assume that the encoded messages all have the same- 
length n binary digits. 

Finally, since the noise will affect the signals, we must devise an 
intelligent scheme for recognizing which input message was sent by 
inspecting the noise-altered message received. Consider the following: 
We have N source messages, each encoded into n binary digits. At the 
receiver, we have a catalog containing all possible 2 ^ n-symbol sequences. 
In the noisy transmission of an n-symboI sequence, the received sequence 




162 


DISCRETE SCHEMES WITHOUT MEMORY 



(there are at most 2” of them) may not agree with any of the N sequences 
catalogued. In order to decide which of the N source sequences could 
have been sent, let us simply choose the catalogued sequence that differs 

from the received sequence by the 
least number of digits. 

A geometric picture of the sug- 
gested code is given in Fig. 4-8. 
Each dot represents one of the 2^ 
possible received messages. The; 
small squares indicate the N ran- \ 
domly selected source messages. ' 
For a given n, the number N should 
not be too large so that the message 
points can be somewhat evenly dis- 
tributed in the message space with 
adequate distance among them to 
‘‘overcome the effect of noise.” 
Loosely speaking, for large values of 
n, one should be able to spread the 
message points so far apart that no 
other possible transmitted messages appear in their vicinity (that is, 
within the circle of the figure). 

Now let us examine the reliability of the suggested encoding-decoding 
procedure when n is rather large. Consider the sequence of n Fs and 
O^s as a sequence of independent Bernoulli trials in which (referring to 
Fig. 4-7) noise-free transmission of each binary digit occurs with proba- 
bility p and erroneous transmission with probability \ — p — q. Then 
the probability of receiving exactly n — r correct digits (or, equivalently, 
exactly r erroneous digits) is 


Fig. 4-8. Each Rqiiarc stands for a trans- 
mitted word. The small dot represents 
a received word not necessarily in the 
vocabulary. The circle illustrates a 
primitive detection rule, that is, we de- 
code the received message as any one of 
the permissible words in the trans- 
mitter’s vocabulary which may fall in 
this circle. 


P j r errors 


where 


( n \ 

n — r) (n — r)Tr! 




(4-59) 


If we choose a random variable Z to denote the number of erroneous 
digits in a received message, then Z assumes values fc = 0, 1, 2, . , . , w 
with probabilities 

Piz-H - 

and its average value is 

E{Z) = z = Y kP[Z = k\ = f k( ” .)P"-Y = nq (4-60) 
*-o t-i ~ Y 




ELEMENTS OP ENCODING 


103 


In other words, in each sequence of length n binary digits, we can 
expect, on an average^ nq digits to be altered by noise. Therefore, 
according to our decoding procedure, each catalogued seciuence that 
differs from the received sequence by nq digits or less could, on an average ^ 
have been the sequence sent. The number of these seciuenccs that can 
be considered as possible original messages, on an average, is 


nq 



For q < }^ 2 , the sum of the last nq terms on the right side of this 
equation is smaller than nq timcjs the largest term, that is, 

(4-61) 

When n is large, the factorial can be approximated by Stirling's formula: 


n\ ^ \/27r 

M ^ ] + nq 


Collecting terms yields 


M 



(4-62) 

(27r) 

V" "-n ^ ^^n{py"' (q) '^q '' " 

(4-63) 

p + 1 

2tp 

(4-64) 


Of these M sequences which, according to our decoding scheme, can be 
considered as possible original messages, only one is correct, and Af — 1 
are potential misinterpretations of the received signal. 

Now we use the further assumption that the binary encoding pro- 
cedure is a random one. That is, for encoding any message, say a*, we 
flip an honest coin n times. We obtain a sequence 


a, = {HTTrilH • • T) 

Then each H is replaced by, say, a 0, and each T by a 1 . Then 


ak = {011100 ■ • • 1 } 

There arc 2^ possible sequences that can be so constructed. However, 
there are only AT < 2” messages in the message ensemble. Therefore, 
it is “intuitively” clear that the probability that an n-digit sequence, 
selected at random, corresponds to one of the N messages of [A] is N/2”. 
(See Shannon’s proof in Chap. 12.) 



]()4 


DISCRETE SCHEMES WITHOUT MEMORY 


Similarly, of the M — 1 potential misinterpretations of the received 
signal, on an average, only N / 2”^ of them could correspond to one of the 
N messages of [A]. Thus, the number of messages of other than the 
correct original message, that could have been changed by noise into the 
received signal, on an average, is 


(M - 1) 

<K / , 


(4-65) 


The quantity M a is indicative of the frequency of the occurrence of an 
error. ^]\) see the relation between the inequality (4-65) and the rate of 
transmission of information it is necessary to l)ring into consideration the 
rate of transmission of a binary symmetric eliannel and its capacity. 
Note that 


C = \ + V log p + g log q and (4-66) 

or 2'^ = 

2 == 


Substituting in (4-65), 

^ - 2n \2 tp - 2’’^ V 2wj) 


(4-67) 


Now, by conveniently choosing N, the number of messages of ^4, to be 
equal to or less than 2^" jn, the following simplification occurs: 


As the length of the encoded messages is increased, the number of the 
original sec^uences (not sent) which could have been erroneously decoded 
is diminished. When n— > oo^ then >0 irrespective of the noise 

characteristic {q < J^)- 

Finally, let us compute the entropy of the input to tlie channel when 
N is conveniently chosen to equal 2^^ jn. (Values of 2"^ for different n 
and C arc given in Fig. 4-9.) 


Bn(X) = 


l^ AT ^ log 2*^" - log n 

n n n 


(4-69) 


As n is made larger and larger, Hn{X) approaches the channel capacity C, 
which also shows that the equivocation entropy approaches zero. 

This completes the heuristic proof of the second fundamental theorem 
of information theory of a BSC. The foregoing proof demonstrates 
that, if the number of messages N is suitably selected, then by randomly 



ELEMENTS OF ENCODING 


105 


encoding messages in binary digits, we obtain a family of random codes 
such that for at least one of these codes it is possible to approach the 
transmission rate C when n is made larger and larger. The closer we 
wish to get to this ideal rate, the more the length of th(‘ encoded messages 
must be increased. This necessarily causes a delay in transmission and 
reception. That is, a higher rate of transmission (;an be obtained at the 
expense of a longer dcday. The above simple proof for the s('(*ond funda- 
mental theorem for JISC' perhaps originated in several notes by scientists 
at Bell Telephone ljal)oratories (for instance, C. E. Shannon and E. N. 
Gilbert). Similar proofs were also given by G. A. Barnard and P. Elias. 
For a complete proof in the more general case, at present, the mathe- 
matical machinery suggested by McMillan, Feiiistein, and Khinchiu 



Ficj. 4-9. Values of 2"^’ coniputed for BSC of capacity C vvlion transniittJiiR words of 
length n. 

would be appropriate. For the sake of reference, the statement of the 
second fundamental theorem is given l)elow, while its j)roof is deferred 
until Chap. 12. 

Consider a discrete channel without- memory, with capacity C. Let 

()<//< C 
e > 0 

There exists a positive integer n depending on t and H ; 

n = /(e,//) 

such that if we consider the transmission of N woi’ds of length n 

N > 2^^'^ 

Then we are able to select the N transmitted symbols ui, U 2 , . . . j un 
such that at the receiver end we can associate with them N distinct 


166 DISCRETE SCHEMES WITHOUT MEMORY 

categories of words ^i, B 2 , Bn with 

P[v e Bt\u,] >1-6 (4-70) 

€ may be made as small as desired. Shannon's original statement of this 
most significant theorem in coding theory was given in two forms; one 
stating the possibility of a transmission rate close to tlu* channel capacity 
with equivocation approaching zero and the other in terms of the vanish- 
ingly small probability of error in selecting input signals from the output 
signals. Thest*- two formulations, although not mathematically identical, 
lead to equivaUuit results from an engineering point of view. Those 
interested in mathematical developments are referred to Feinstein (1), 
Khinchin, and Chap. 12. 

The converse of the second fundamental theorem states the important 
fact that, no matter how clever we are, it is impossible I .0 devise i\ 
reliable encoding leading to a transmission rate higher than the channel 
capacity C\ The c.omplete mathematical proof for this important stMt(‘- 
ment of Shannon's theorem was first derived by J. Wolfowitz (I). 

In the following section we give some specific examples of encoding in 
the presence of noise. It will be shown that it is possible to overcome the 
effect of noise by some appropriate encoding procedures, if one is willing 
to use more complex methods and longer blocks of encoding sequences. 

At present the two fundamental theorems described in this chap- 
ter form perhaps the most important aspect of information theory. 
Although these theorems may not seem of a practical nature, they most 
clearly exhibit the upper bound of accomplishment for communication 
apparatus. This is perhaps the most interesting result and ‘The golden 
fruit" of the theory, as has been pointed out by several writers.* 

Unfortunately, the coding theory has not yet provided adequate 
methods for reaching this ideal aim. Slepianf appropriately remarked 
about this: 

From the practical point of view, the fundamental theorem contains the golden 
fruit of the theory. It promises us communication in the presence of noise of a 
sort that wiis never (Irt'amed possible before: perfect transmission at a reasonable 
rate despite random perturbations completely outside our control. It is some- 
what dish(;ar telling to realize that today, ten years after the first statement of 
this theorem, its content remains only a promise that we still do not know in detail 
how to achieve these results for even the most simple non-trivial channel. 

4-12. Error -detecting and Error-correcting Codes. In the earlier 
sections of this chapter, we have discussed a number of basic encod- 
ing procedures for discrete independent sources connected to discrete 


* For example, Robert Pierce, Frontispiece of PGIT, vol. IT-5, no. 2, June, 1959. 
t U. Slepian, Coding Theory, Nuevo cimento^ vol. 13, Suppl. 2, pp. 373-383, 1959. 



ELEMENTS OP ENCODING 


167 


memoryless channels in the absence of noise. From the practical point 
of view, it is of prime importance to devise encoding methods leading to a 
reliable transmission of information in the presence of noise. Examples 
of the need for such reliable transmission procedures are found in the 
operation of automatic telephone systems, large-scale digital computers, 
and in the new field of automata. In many applications of these types, 
the transmission of information must be kept error-free at (luite a high 
level of reliability. Unfortunately, at present, there is no simple encod- 
ing method, analogous to Huffman\s optimum coding, available for the 
transmission of information through noisy channels. The existing 
methods are generally complex and confined to binary channels with a 
relatively low rate of information transmission. 

The first complete error-detecting and error-correcting encoding pro- 
cedure was devised by Hamming in 1950. Hainniing^s method represents 
one of the simplest and most common encoding methods for the trans- 
mission of information in the presence of noise. We assume that the 
source transmits binary messages and that the channel is a binary sym- 
metric channel (Fig. 4-7). In a message which is n digits long, a number 
of m < n digits are directly employed to convey the information and the 
remaining k = n — m digits arc used for the detection and correction of 
error. The lal ter digit s are called paritij checks. Thus, in a certain sense, 
one may say that the relative redundancy of the procedure is R > n/m. 

Hamming’s single-error detecting code can be described as follows: 
The first n — 1 digits of the message are information digits; in the nth 
place we put cither 0 or 1 so that the entire message has an even number 
of I’s. This is called an even parity check procedure. Evidently one 
can as well use an odd parity check, or one may wish to place the parity 
check at some other specified position. Examples of even and odd parity 
checks are given below; 

Messages 
' 100101 “ 

010010 

101100_ 

Messages with even parity checks Messages with odd parity checks 


P P 


'1001011 

r 


1001010 

o' 

0100100 

0 


0100101 

1 

1011001 

1 


1011000 

0 


When a single parity check is used, if a single error occurs in a received 
message it will immediately be detected, although the position of the 



168 


DTSCIIETK SCHEMES WITHOUT MEMORY 


erroneous digit will not be determined. For example, with an even 
parity eheek, if we receive a message such as 1101011, we detect an error, 
indicating that an odd number of digits has been transmitted in error. 
However, we have no specific knowledge of the position or the number of 
the errors. In the preceding we have tacitly assumed that the parity 
check was received without any error. This of course in itself may not 
be correct. Nonetheless the parity check improves the reliability of the 
transmission; tliat is, it increases the probability of the detection of 
error. 

Hamming has also developed an error-correcting scheme which will 
not l)e presented liere in dcdail, except for th(‘ case of a single error. 

SiiKjlc-crror Ddveiion aiul (^nrvetion. ()\ir problem is to devise a 
method capable of : 

1. Revealing tbe occurrence of a single error in any binary message 
block n digits long 

2. Detecting the position of the erroneous digit 

In an n-digit message, it is assumed, either no error or a single error 
occurs. But if the error occurs it may be in any one of the n possible 
positions. This procedure will be discussed in Sec. 4-13. 

4-13. Geometry of the Binary Code Space, ('onsider all encoded 
messages having n digits and constructed as s(^quences of letters taken 
from an alphabc't of 1) letters. Each encoded message can be considered 
as a point in an a-dimensional space. If for convenience some arbitrary 
numbers are associated with these 1) letters, then each point of the code 
space will have real coordinates. J^hr example, when using a binary 
alphabet, we are led to points in the n-dimensional space with every 
coordinate being either 0 or 1. Such a geometric model has a certain 
natural appeal for discussing binary encoding problems. This model was 
initially employed by Hamming in 1950 and since then has found con- 
siderable use in connection with binary coding. 

Let U — [ai,a 2 , . . . ,««] be a binary word. That is, = 0 or 1. 
We may define the distance between two points U and V = [^hP 2 j ■ ■ ■ ,i3nl 
as the number of coordinates for which all a*- and fih are xlifferent. 

DiU,V) = 2 (4-71)* 

k=l 

To justify the use of the word “distance,^’ the validity of certain 
mathematical properties of D((/,F) should be examined: these are 

* The notation 0 here implies 

10 0 = 1 
0 0 1=1 
0 0 0 =0 
10 1=0 



ELEMENTS OF ENCODING 


U)9 

D{UjV) = 0 if, and only if, V = 1" 

DiU,V) = D{V,U) >0 r (4-72) 

DiU,W) + D{V,W) > D{U,V) 

The validity of these properties is self-evident. As an example, note that 

U = 1001 = 4 

V = 0110 D{IJ,W) = 1 

w = 1000 D{v,w) = 

1 + 3 = 4 

Now that the distance has been defined, we are \n a position to define 
a sphere of I’adius r and eent('red at point U as tlu' locus of all points in the 
code space that are at a distance r from V. Thus, in the above example, 
point y is on a sphere with center at \V and radius 3. 

Now suppose that, at the inpuL of the channel, all (‘ucoded words are at 
a distance of at least 2 from each other. Tf, because of an ('rror in trans- 
mission, one sinjrle error ocxairs, then a word will be erroneously received 
as a meaningless word, that is, a word that do(‘s not (‘xist in the transmis- 
sion vocabulary. Thus, in such a setup any single error is detectable. 
If the minimum distance between points r('])resenting code words is 
taken to be 3 units and a single error occurs in the iTansmission of a word, 
then the poirit representing the nic.eived word will be 1 unit apart from 
the point representing the correct word. The correcit word can be 
identified by finding the closest pcirmissible word to the one received. 
Such schemes can be used for single-error detection and correction. The 
following data are given by Hamming: 

Mmijnutn required dislance 
between every two coded 

words Description of the coding 

1 KiTor (•;iiinc)t be detoeted 

2 A single on or can he detected 

3 A single error can he competed 

4 A single error can he corrected 

plus double error dtaected 

5 Double-error correction 

Having the concept of distance in mind, we now can raise the following 
question. How many code words at most can be included in a vocabulary 
containing only n-digit words subject to a single-error detection? Or 
alternatively, what is the largest number of vertices in a unit n-dimen- 
sional cube such that no two points are closer than 2 units from each 
other? A code book (n,d) containing the greatest number of words for 
a specified number of error detections and corrections is referred to as an 



m 


DISCRETE SCHEMES WITHOUT MEMORY 


optimal code. The Hamming codes (n,d) are generally referred to as 
systematic codes. 

First consider the problem for n = 2. As shown in Fig. 4-10, there are 
four points in this space with the following coordinates: 


"cr 


■() r 

Vi 


1 1 

r, 


0 0 

V, 


1 0 


Thus D{Ui,lh) = DiU^lh) = D{U,,U,) = D{Ua,U,) = 1 
D{UuU,) = D{U2,U,) = 2 


There are two such sets of points with a distan(;e of 2 units from each 
other. In other words, it is ol)served that a two-dimensional cube has 
2^ vertices, among which at most 2^ 
points are 2 units apart. 

For n = 3 we note (Fig. 4-11) that 
the points are tlie vertices of two dis- 
tinct squares. Obviously, the 1 wo 




I’jo. 4-10. Four points with a 
mininiuin mutual distance of 
1 unit. 


Fifi. 4-11. of a unit (‘ul)c 

threc-dimonsionul space. 


sets {UijU^^U 2 }U'^) and {U 2 jU 2 jU[jU[) ihe above distance require- 

ments; that is, the distance among points of each scd is at least 2 units. 


"t/r 


"010" 


"F," 


on" 

Ui 


no 




in 

Ui 


000 


F3 


001 

.V*. 


_100_ 


F4 


_ioi_ 


A three-dimensional cube has 2^ vertices, among which at most 2® points 
are 2 units apart. This reasoning could be extended without any dif- 
ficulty. We shall conclude that an n-dimensional cube has 2” vertices, 
which can be considered the vertices of two distinct (n — 1) -dimensional 
cubes. Among all these vertices there are at most 2"~^ points which are 
2 units apart. Therefore, for single-error detecting schemes with code 
words each n symbols long, we can have at most 2^^ words. The result 




ELEMENTS OF ENCODING 


171 


may be summarized by writing 

BM = Bin, 2) = 2--1 (4-73) 

BiUjd) being the upper bound for the number of code words of length n 
and a minimum mutual distance of 2 units. 

The following interesting results have been obtained by Hamming and 
will be quoted without further proof. In the light of th(' abovc^ dis- 
cussion, the reader may wish to prove them as an exercise. 


Bin,l) = 2- 
Bin, 2) = 2^-1 

B(n,3) = 2“ < ^ ^ 

Bin, A) = 2”* < ' 


(4-74) 


Bin - i,2K - \) = Bin,2K) 

4-14. Hamming’s Single -error Correcting Code. In this section we 
discuss Hamming’s code for single-error detection as well as correction. 
The method demonstrates how one can improve the reliability of the 
transmission of iiiforimition in the presence of noise. In ord(‘r lo cf)rrect 
a single error in any one of the n positions, we need n + 1 independent 
‘‘pieces of information.” (One piece of information is reciuired to show 
that no error has occurred.) With n parity checks, it is possible to have 
at most 2*' distinct parity words. If a one-to-one correspondence among 
the parity words and error loc^ations is to be established, w(^. must retpiire 
that 

2^ > n + 1 (4-75) 


As for the transmission procedure, in lieu of transmitting a word 
we compute its corresponding parity-check word 
and transmit the word XiX 2 • • • Xn- (The parity checks must be such 
that distinct m words have distinct parity words.) Then, at the receiver, 
we must devise a technique for determining the position of any possible 
single error, or no error. The method can be illustrated in terms of an 
example. 

Suppose that we wish to devise a single-error detecting and correcting 
code for blocks of four binary digits. The smallest number of the 
required parity checks k is given by 

2^ > n + 1 = 5 
fc > log 5 

In order to be able to transmit four information digits we need to have 
at least three parity digits in each block. In fact, let x^ denote the digit 



172 


DISCRETE SCHEMES WITHOUT MEMORY 


in the iih. position in a sequence of seven digits. The parity checks 
[xB,Xe,X7] may be derived from the modular 2 equations: 

xi + X2 + Xi + = even si 

Xi + X2 + xa + Xt = even (4-76) 

Xi + X:i + xa + xt = even s.-j 


I'or any received word, the truth set (validity) of these ecpiations can be 
exhibited by the three sets of Fig. 4-12. 



Fi(J. 4-12. A sot-theoretic approach for cirri vinj!; IIainiiiinK’« siiiftle-error correcting 
(‘(piations aTnoii^? inlormation and parity (*Iie(‘ks. 


There are seven disjoijit sets in this figure, and each is associated with 
only one variable x^. The variable Xi belongs to the (common intersection 
S1S2S3. The variable X2 belongs to the set and so on. 

If only 5i fails, then x^\s incorrect. 

If only S2 fails, then xq is incorrect. 

If only S3 fails, then Xy is incorrect. 

If only Si and S2 fail, then X2 is incorrect. 

If only Si and Su fail, then 0-3 is incorrect. 

If only 52 and S3 fail, then X4 is incorrect. 

If si, S2, and S3 fail, then Xi is incorrect. 

If Si, S2, and S3 are valid, then there is no error. 

Thus if we assume that not more than a single error niay occur in blocks 
of seven digits, with this method we shall be able to correct all such 
possible errors. For example, if lOlOIlO is received, Xg is in error. 

The validity of the foregoing method is based on the fact that the 
suggested three sets embody seven disjoint subsets, each subset being 
in a one-to-one correspondence with a logical proposition concerning the 
validity or the failure of sets si, S2, and Sn. Hence, a one-to-one corre- 
spondence between the seven variables and the corresponding seven logi- 
cal propositions may be established. 

Next, we may generalize the above procedure and suggest a logical 
method for writing the required number of modular 2 equations. For an 


ELEMENTS OF ENCODING 


173 


(rijk) code, the following steps in the selection of the appropriate terms 
of the basic k equations seem self-explanatory. 

Step 1 . Denote by s^ the logical proposition of the validity of the zth 
equation (t = 1 , 2 , . . . , k) and include only in the zth equation. 
These are the totality of the parity cheerks. 

Step 2 . Include Xi in all k equations. 


Step 3. 


Include each of the next 



= k variables, that is, 0 : 2 , 


.T 3 , . . . , •Xfc-i-i, in k — 1 equations [as they occur in a general set-theoretic 
diagram; lhat is, each equation contains 1 + C7‘) + C 2 ')+ ■ 

+ (a- - !) 

Step 4. Include each of the next = }ik{k — 1 ) variables, 


that is, Xfrfo, ■ . . , in k — 2 equations (as they appear in a general set- 
theoretic diagram). 

Step 5. Continue this method until the last information digit x^ is 
included. 

The k equations obtained in this manner constitute the main rule for 
encoding and decoding. For encoding, one computes the parity checks 
to be transmitted along with any given information secpience. For 
decoding, one can consider the sequenc.e s = S 1 S 2 • ■ • .sv, where each 
individual term Si assumes the value 0 if the lih ecpiatiori is valid for the 
received message and the value 1 otherwise. There are 2^^ distinct possi- 
bilities for s sequence, and as long as 2^ > n \ we are able to make a 
one-to-one correspondence between ev(*ry variable Xj and a distinct 
5 .sequence. As examples of the applications of the foregoing rules, 
we state the results for the following two cases: 


Case A m = i k = 2 n = S 

Xi + X2 = 0 
xi + . 1-3 = 0 

Case B m = 11 k = 4: n = 15 

Xi X2 X'i Xe Xt Xu X12 = 0 

j Xi X 2 X‘A x^ x% -]r x^ Xiii + Xi3 = 0 

Xi X2 X4 Xb Xj X9 Xii + = 0 

Xi + Xa + Xi + X^ + Xs + Xio + Xu + XiB = 0 

In the practical application of single-error correcting codes, the follow- 
ing method of message numbering makes computation quite simple. 
This method* is based on an appropriate bookkeeping procedure sug- 
gested by Hamming. 



174 


DISCKETE SCHEMES WITHOUT MEMORY 


tilcj) 1. Number the messages to be transmitted as 0, 1, 2, 3, . . . 
and choose for the message My the binary expression of number^'. Sinee 
all messages must contain m information digits, add zeros to the left of 
the binary number./, if necessary. 

Step 2. Number the positions in all n words from left to right. 

Step 3. Assign • • -] to check positions given by 

Xi + X-i + X^ + XT + + Xii + + ■ =0 Si 

X2 + ^3 + JTe + ^7 + ^10 + Xii + ^‘14 + ■ =0 S2 

Xa + .1-6 + Je + XT + a:i2 + Xis + Xi4 + • =0 Sa 


(Iru'Jude only those position numbers containing 

a 1 in their zth digits) = 0 St 

Step Th(' selection of i)arities should be according to step 3. When 
a message is received, check the equations in step 3. Wh(*n equation 
is valid for the received messages, let = 0; otherwise, .s, = 1. Next, 
compute the binary number 


/ = • • • S2S1 

The suggested melhod (Hamming) indicates that the digit in the /th 
position (step 2) must Inwe been in error. 

I'or example, when the met hod is applied to 16 messages (n = 4, /c = 3), 
if 0101101 is received, we find I = 100; therefore x^ is in error. The 
proof of the validity of the method is based on two facts: (1), the 
selection of independent equations or, what amounts to the same thing, 
the establishment of a one-to-one correspondence between n + 1 positions 
and distinct subsets of k sets, as described before; and (2), the proper 
ordering and assignment of binary numbers relevant to the chosen order- 
ing system. The reader may wish to consult Hamming's paper or 
“Logical Design of Digital Computers," by M. Phister, Jr. (John 
Wiley & Sons, Inc., New York, 1959). 

However, the constraint of a fixed number of errors in a block is not a 
practical one. In a BSC, let p be the probability of error in the trans- 
mission of a digit. Then the probability of receiving an incorrect word 
n digits long when np 1 is 

- P)”“‘ + ~ P)"“’ + ■ • ■ + = 1 - (1 - p)" 

= np — }' 2 n{n — l)p* + • ■ ■ 

« np (4-77) 

The probability of an incorrect word after applying a single-error cor- 



recting scheme is 


ELEMENTS OP ENCODING 


175 


- p)""® + • • ■ + = 1 - (1 - P)" - np(l - p)"~* 

= ^n(n - l)p2 + • ■ • (4-78) 

Thus the probability of decoding an incorrect word will be reduced from 
np to approximately “ 1)P^- example, if p = MoOj without 

any corrective measure, the probability of decoding an inciorrect word of 
seven digits is 0.07.) When Hamming’s single-error correcting code is 

m 
100 

50 

20 
10 
5 

2 

1 0.5 0.2 0.1 

Fio. 4-13. To the*, left of tlio curve, the proha})ility of erroneous decodinKfor HammiDg's 
SEC cod(? is larger tlian when ih) encoding is used, and, io the right, the probability is 
sTiuiller. 



applied, the probability of an incorrect w^ord is often reduced. The ratio 
of the two probabilities gives an indication of the improvement brought 
about by the encoding scheme. This ratio is referred to as the figure 
of merit of the code. Thus the figure of merit for a single-error cor- 
recting code is 


1 — (1 — p)”* 

T — (1 — p)" — np{i — pY~^ 


(4-79) 


The probability of receiving an incorrect block after single-error cor- 
rection can easily be computed for dilTerent n and m = n — k. Let 
N be the number of messages to be encoded. The number of information 
digits m is the smallest integer that is larger than log N\ note that 



(4-80) 



176 DISCRETE SCHEMES WITHOUT MEMORY 

The following table and the graph of Fig. 4-13 were obtained from the 
reference given in the footnote.* 


N 

4 

8 

16 

32 

64 

128 

256 

512 

1,024 

m 

2 

3 

4 

5 

6 

7 

8 

9 

10 

n 


6 

8 

9 

10 

11 

12 

13 

14 


The region to the right of the curve in Fig. 4-13 is the region where the i 
probability of error becomes lower; thus ITamming^s code can be success- \ 
fully applied. Note that this region corresponds to the cases most fre- 
quently encountered in practice. 

4-16. Elias’s Iteration Technique. The Shannon-P'einsteiii fundamen- 
tal theorem proves tlie existen(;e of coding procedun^s allowing transmis- 
sion of information at a rate less than or e(|ual t o the channel capacit y in 

the presence of noise. Howeverj no highly 
effective encoding procedure's similar to 
Huffman’s technique for noiseless channels are 
yet known. Most of the encoding techniciues 
thus far discovered are ratlu'r (iomplex, yet 
they do not always permit ti’ansmissioii at a 
rate as close to the channel capacity as de- 
sired. Among the existing codes, the error- 
correcting codes discussed earlier are perhaps 
the least complex ones. The iteration method suggested by P. Elias for 
binary symmetric erasure channels, as given below, is a good illustration 
of the application of error-correcting techniciues (see Elias [II]). 

Consider a BEC as illustrated in Fig. 4-14. The input sequence of 
O’s and I’s is divided into blocks each Ni — \ digits long. To each such 
block we add a parity check, say to the A^ith place. The parity check 
will be selected 0 or 1 so that the total number of 1 ’s in each block becomes 
an even number (even parity chi'ck). 

iij ^2, . . ■ , fiVi-i, Cl 

[0, 0, ... , 1, 1 I 

The average number of erasure digits in a block is 


Ni 



2 = 0 


Since with a single parity check any single error will be detected and cor- 



G. A, Shastova, Radiotekh. i Elektron., vol. 3, no. 1, pp. 19-26, 1958. 



ELEMENTS OF ENCODING 


177 


rected, the average number of erasures in blocks of Ni digits is 

m - Eiz^) (4-82) 

where E{zi) is the average number of single erasures in the block, i.e., 


E{zi) 



(4-83) 


Equation (4-83) yields 

E(z) — E{zi) = N^q — = N iq{l — (4-84) 

A simple upper bound for the average number of remaining erasures is 

E(z) - E(zt) < N,q\] - (1 - q)^^] 

< N,qO - 1 + N,q) = (N.qy (4-85) 

For example, assume that the following blocks (Ni = 10) were received : 


10 0 10 10x0 
0 1 X X 0 1 0 1 0 
1110 0 1110 


1 

1 

X 


In the first and the third block the erasure must have been a 0, while the 
originals of th(^ double erasures in the second block remain unidentified. 

By means of this tec,hni(iue the average number of erasures in each 
blr)(;k is reduced from N iq to not more than {NiqY^ It must be observed, 
however, that the rate at which the information is supplied to the channel 
is ill the meantime reduced by a factor of (A^ — 1)/Ai- Eliases iteration 
techniipie suggests the transmission of A/' 2 — 1 blocks of the above type. 
The A2th block is a parity block in which each digit is a parity check for 
the digits above it. 


0 1 0 • • • 

r 



(; dl 





h i 

1 0 0 • 

0 


information 

e g 




digits 

c i 

1 1 0 • • ■ 

1 



k t, 





s 

1 0 1 ■ • • 

0 


check digits 

check 
digit _ 


The matrix is partitioned into information digits and parity digits. The 
rows are transmitted in order. Thus, all rows with a single erasure are 
properly decoded. Since the information digits are statistically inde- 
pendent, when the last row of the matrix is received, we shall be able 
properly to decode each column that contains only a single erasure. 



178 


discrktj: schemes without memory 


Define r^i and <72 as the average probability of erasure after correction of 
rows and columns, respectively. Then, the following relations hold: 

Niqi = Niq(l - with qi < N^q^ (4-80) 

AT 2^2 = A^2(7 i (1 — with <72 < ^^2^1^ < A^2iViV (4-87) 

whore Pi = 1 — 


For example, ifA7^i = iV 2=10 and q = the average erasure proba- 
bility after matrix block correction is less than 2.5 per cent, roughly at' 
fivefold improvement as far as the erasure probability is concerned. The \ 
iteration process can, of course, be continued. We may use N-.] matrix 1 
blocks in which the last block is a check. The new erasure probability 
for the remaining digits after the described corrections is 

< A^373(1 - QQX 

q, < N,q2^ < N,N2^qi^ < ^ 


As the iteration process is continued, the remaining erasure probability 
becomes smaller and smaller. The rate of transmission in bits per sym- 
bol is meanwhile decrc'jised. However, the decrease of the rate is very 
slow. For instance, for a particularly simple computation, let, with 
Elias, Nk = 2Nk-i for /o = 1, 2 , . . . ; then 


qif < 2Nk-iql-i 
qk < 2^-^Niql-i 


(4-89) 


For q = 3^20 and A^i = 10 we have 

qi < Vio 2-2 

q 2 < Ho 2-^ (4-90) 

ga < Ho 2-« 

The rate at which the information is supplied to the channel, assuming 
equiprobable and statistically independent digits, is the ratio of the infor- 
mation to the total number of digits, that is, 

ATi - 1 Ar2 - 1 ATs - 1 
Ni N2 N, ' ' ' 

R>\- Ko(l + 2-‘ + 2-^ + • • ■) 

R > 0.80 


Note that with this procedure we have been able to reduce successfully 
the probability of erasure, while the rate of transmission of information 
has moderately decreased. In fact, as was pointed out in Chap. 3, the 
capacity of the BEC is 


C = 1 - g = 0.95 



ELEMENTS OF ENCODING 


179 


We have not succeeded in reducing the error probability while 
approaching the maximum rate of transmission. Such an achievement 
would require much more elaborate encoding schemes. It is also to be 
noted that the lowering of the remaining erasure probability is accom- 
panied by a greater delay required between the time of transmission and 
the time of decoding a digit. For a relation between the delay and the 
error probability see P^lias (II). 

Example 4 - 9 . Eliaa'fi Block Coding. Consider a binary erasure channel with 
erasure probability q = The source lias a iiuTnb(*r of messages which are encoded 

in binary dig;its such that in a long sequence of messages 0 and 1 appear independently 
and with equal probabilitiiis. Each informa- 
tion digit is transmitted twice in order to com- 
bat th(‘ effect of noise. 

(a) Determine the rate of the injnit 
inform atinn. 

{b) Determine the average, fraction of infor- 
mal ion-digit erasure. 

(r) Determine tlie rate of transmission of 
information in the (‘hannel. 

(d) In order to improve the transinformation 
late, we use Elias’s block coding. More 
specifically, we transmit message mat i ices of 
the form 

?1 1-2 U Cl 

I A 'th le <’ 'l 
l7 fn (\\ 

_C4 Cfi Ce C7- 

Ci, C 2 , and Cg are parity checks on the first, 
second, and third rows, respectively; Cx, c^, and 
C6 are parity checks on the first, second, and 
third columns, cy is a parity check on the 
parity checks above it. The remaining digits are information digits. Repeat parts 
(a), (b), and (c) for this encoding procedure. 

Solution 

(a) Repetition of each information digit will reduce by half the rate at which the 
information is supplied. 

1 X M M bit per symbol 

(b) We assume that the input is supplied in groups of two digits; in each group one 
digit acts as a parity check on the other (information digit). Therefore, if we receive 
a pair lx or tO, we know that the original pairs were 11 and 00, respectively. Thus, 
single errors are corrected. The probability of double erasure is q ’ q — /^56- 

(c) The equivalent channel of Fig. E4-9a is self-explanatory. For simplicity, 
think of 00 and 11 as two equiprobable input messages to the channel; thus 



H{X) « 1 

iy(X|y) = -qnog H - gMog H - 2g* 

R = I(X;Y) = H(X) - H(X\Y) = I - 2q^ 



180 


DISCRETE SCHEMES WITHOUT MEMORY 


But this rate is accomplished for the equivalent channel. However, in our original 
scheme only one digit carried information. Thus the rate in question is 


(d) Each letter in the message matrix is equiprobablc; thus the average input rate 
for each information digit is bit per symbol. The probability of tlie occurrenc.e 
of more than a single erasure can be calculated as follows; The average* number of 
erasures in blocks of N digits is Nq. The av(‘rage number of a single erasure is 



Thus the desired average is 

Nq — Nqp^ ‘ = Nq{l — 

= 51 (.11 - 

The average number of rows with an erasure (aft(‘r si single corretdlon) is Hn - 
51 (i)®l- OtdiiK* qi, the average* era.siire probability after correction of the rows, as 

= Moll - 

The average* erasure probability after checking by rows and columns is 
72 = Nqi - * = }4[1 - (ii?ic,)^l(l - pr’) 

4-16. A Mathematical Proof of the Fundamental Theorem of Infor- 
mation Theory for Discrete BSC.* The matoriiil of this section is a sup- 
plement to Sec. 4-11. Having the notation of that sc(*ti()n in mind, U‘t 
P\A,^\at\ be the conditional probability of having transmitted the mes- 
sage a, and received any one of the messages in the set 4/, that is, i,he 
set of all possible words that differ from a, in not more than r digits. 
Now assume that we have found an encoding procedure for which 

r 

^ > I - e t = 1, 2, . . . , AT (4-92) 

A'=0 

where e is an arbitrarily small positive number and Ax are mutually dis- 
joint sets. Evidently, if N is not too large, it would be possible to encode 
the messages in binary words so that the above conditional probabilities 
are as large as desired. In fact, the conditional probability can be 
increased by increasing n, the length of the words, when N and e are 
specified. The disjointness of the Ax is not of much concern as long as 
N is quite small. For example, if only two messages Oi and a 2 are to be 
transmitted, by assigning an adequately long sequence of O^s and l^s to 
ai and a 2 , respectively, the requirements of Eq. (4-92) are met. But it is 

* The proof given here is a condensed version of a proof derived by D. D. Joshi. 
For further information see U. D. Joshi, LTnforraation en statistique mathtoatiqucj 
et dans la th^orie des communications, Pahl. inst. atatistique univ. Paris, vol. 8, no. 2, 
pp. 95-99, 1959. 



ELEMENTS OF ENCODING 


181 


not at; all evident that such an encoding procedure would exist when N is 
comparatively large and e arbitrarily small. To this end, we compute 
some bounds on the number of A^o, the upper bound of N. 

Based on distance considerations, it becomes clear that the number of 
disjoint regions Aj is less than the maximum number of disjoint spheres 
with radius r packed in an n cube. A direct estimate of the upper bound 
of this number can be made (Hamming's) by computing the ratio of the 
total number of points 2” to the number of points in each sphere. That is, 



Let a = r/m and (3 = ] — a; then for sulliciently large n, say n > 
Eq. (4-93) gives 


Thus 


/tor 

n—0 

On 

< No < 


n >no 
2 " 


(4-94) 


/3 


7 — r V. 0 y— 

-(”) (") 

— a \7ia/ \7la/ 


n- log- ^ - log (2) < log ATo < n - log ^2) (4-95) 

For a moment, consider a hypothetical BSC with parameters 


</o = a > - Po = P 

71 

Cg = 1 + a log a + iS log /? < C 

It will be shown that the nth-order extension of this channel will lead to 
an ideal transmission, that is, when 7i approaches infinity, A^o and the 
detection error remain bounded by Eq. (4-95) and Eq. (4-92), respec- 
tively. Subsequently, it will be demonstrated that the fundamental 
theonmi remains valid when the parameters of the hypothetical channel 
approach those of the specified channel (p, 5 ,C). We employ an approxi- 
mate form of Stirling's formula, that is, 



(2ir)^w"+i^"c“" < n! < (2ir)*^"(n + 

(4-96) 

or 

log n! < log J— -H (n -1- H) log (n + - n log e 



log nl > log \/2ir -h (w + K) log n — n log e 

(4-97) 



182 DISCRETE SCHEMES WITHOUT MEMORY 

Proper application of these bounds yields 


log ( ” ) > log - - n(a log a + /3 log /J) 

\na/ y/ Jnrnap 


- {na + M) log 


+ 2 L) ~ 


(nP + 1 ^) log 


(’ + i) 


< log — n(a log a + log /3) 

V 27renci'p 


(4-98) 


+ (n + log ( 1 + 7 ^ 


('+^) 


The completion of the proof of the theorem r(‘(iuires the introduction 
of the channer.s capacity into our calculations. 


log No > nco + log y/'lircna^ — (n + Yi) log — 

log No < ncf) -|- log V'27rnQ'^ e + {na + Yi) log 


/3 — a 


-f (n/3 + Y) log 

The entropy per transmitted symbol satisfies the iruuiualitice 

1 1 Tvr 1 1 2Trena{l3 — 

-logAr„-r„>2-log . 




- log JVo — Cn < 


“ (’ + 2n) (’ + 2^) 

1 1 { \\ ( \\ 

- log JV„ - c„ < 2 - log 2^na&c^ + ^„ + _ j log (^1 + 2 J 

+ (^ + i) + i) 

By applying the above detection scheme and greatly increasing the word 
length, one finds 


lim 0 log - Co -> 0 


(4-100) 


The upper bound of the number of words in the transmitter’s vocabulary 
approaches 2"^“ as n is made larger and larger. Meanwhile, the error in 
decoding each message a^ is kept under control. If the message a* is 
transmitted with a probability p, for i = 1, 2, . . . , N, the average error 
remains bounded, that is. 


Average error < ^ p,€ = € 


(4-101) 



ELEMENTS OP ENCODING 183 

Finally, one may choose a arbitrarily close to q and follow the same 
reasoning to conclude that 


lim ~\ogN = C 
n 

This completes the proof of the fundamental theorem for discrete memo- 
ryless BSC. Similar proofs were also given earlier by G. A. Barnard, 
P. Elias, E. N. Gilbert, and I). Slepian. 

4-17. Encoding the English Alphabet. The redundancy of the English 
language has been estimated by Shannon and several other authors. 
The word ‘‘estimate'^ is used here, as the problem in itself is not mathe- 
matically well defined. In a rough estimate we may assume that the 
alphabet consists of 26 letters with mutually independent probabilities. 
The maximum entropy of such a system is log 26 = 4.64 bits per letter 
when all letters are etiuiprobable. Of course such an estimate is unrealis- 
tic. In an approximation, w(; may compute the desired entropy based on 
the frequency of letters as shown in Table 4-1. The corresponding 
entropy is found to be 4.3 bits per letter (1). A. Bell, p. 164). 

Table 4-1. Fkkquency oe liiCTTEiis in English Lanchjage 


A 

7 

81 

nil 

N 

7 28 

1100 

B 

J 

28 

101000 

0 

8.21 

1110 

a 

2. 

93 

01010 

p 

2 15 

110111 

D 

4 

11 

11010 

Q 

0.14 

1101100101 

E 

13 

05 

100 

R 

C ()4 

1011 

F 

2 

.88 

01011 

s 

6 40 

0110 

a 

I 

39 

00001 

T 

9 02 

001 

n 

5 

85 

0001 

u 

2 77 

01000 

I 

() 

77 

0111 

V 

1.00 

1101101 

j 

0 23 

1101100110 

w 

1.49 

101001 

K 

0 

.42 

11011000 

X 

0.30 

1101100111 

L 

3 

.(H) 

10101 

Y 

1.51 

00000 

M 

2 

02 

OlOOl 

z 

0 09 

1101100100 


The main shortcoming of this calculation is that the successive letters 
are supposed to be transmitted independently of each other. This is, of 
course, not true, as the transmission of English letters in ordinary mean- 
ingful text is more of a stochastic nature of the Markov type. That is, 
the probability of the transmission of a letter is strongly affected by the 
probability of the transmission of the preceding letters. For example, 
the letter T is almost never followed by X but is very often followed by 
H. Therefore, in a better approximation we should compute the entropy 
based on the frequency of the occurrence of two successive letters. This 
'vill lead to the computation of the entropy of a discrete source with 26* 
symbols. Similarly, one could compute the freiiuency of the occurrence 
of any possible three-letter combinations and find the corresponding 



184 


DISCRETE schemes WITHOUT MEMORY 


entropy. Several authors have estimated the entropy of English text. 
Their estimates indicate that the redundancy of the English language is 
somewhere between 0.50 and 0. 80. (It should be kept in mind that 
redundancy is not always undesirable. For example, in the presence of 
noise redundancy contributes to improvement in the intelligibility of the 
text.) 

Tahle 4-2. Codes for English ALriiABET 


Probability 

Letter 

HulTiuan code 

Alphabetical code 

0 1859 

space 

000 

00 

0 0042 

A 

0100 

0100 

0.0127 

B 

01 1 1 1 11 

010100 

0 0218 

C 

11111 

010101 

0 o;n7 

J) 

01011 

01011 

0 10:31 

E 

101 

01 10 

0 0208 

F 

001 100 

011100 

0 0152 

a 

01 1101 

011101 

0 0467 

H 

1110 

01111 

0 0575 

1 

1000 

1000 

0 0008 

J 

0111001110 

1001000 

0 0049 

K 

01110010 

1001001 

0.0321 

L 

01010 

100101 

0 0198 

M 

001101 

10011 

0.0574 

N 

1001 

1010 

0 0032 

0 

0110 

1011 

0 0152 

P 

011110 

110000 

0 0008 

Q 

0111001101 

110001 

0 0484 

R 

1101 

11001 

0 0514 

S 

1100 

1101 

0 079(i 

T 

0010 

1110 

0 0228 

U 

11110 

111100 

0 0083 

V 

on 1000 

111101 

0 0175 

W 

001110 

111110 

0 0013 

X 

0111001100 

1111110 

0 0104 

Y 

001111 

11111110 

0.0005 

Z 

0111001111 

iiiiiiiT 


Cost 

4 . 1 195 

4.1978 


According to Shannon’s* statistical study, printed English texts 
(27 letters including a space) are approximately 75 per cent redundant. 
If all letters were eijui probable, the same “information” could be encmlc^d 
in texts roughly one-fourth the size of the noncoded text s. This estimate 
has been supported by Burton and Licklider.f Although interesting 

* C. E. Sliannoii, Pr(Hli('tion and Entropy of Printed English, Bell System Tech, J 
vol. 29, pp. 147-100, 1951. 

t N. G. Burton and J. C. R. Licklider, Long-range Constraints in the Statistical 
Structure of Printed English, Am. J. Psychol., vol. 68, pp. 650-65ii, 1955. 



ELEMENTS OF ENCODING 


185 


work on this subject has been done more recently jit Harvard ITniversity, 
it remains outside the scope of this study. The interested reader is 
referred to the following article and references given there: George A. 
Miller and Elizabeth A. Friedman, The Reconstruct^)!! of Mutilated 
haiglish Texts, Injorm. and Control^ vol. 1, pp. ;t8-5r), 1957. 

Historically speaking, the first significant encoding of a language 
structure was the con!rnon Morse code. In this code, dot, dash, and space 
are uscM. The dash occupies a time equal to the time of transmission of 
three dots with no significant time space between them. The space 
occupies a time ecpial to that of a dot. Tf these durations are taken as 
cost units, then the average cost of the Morse-encoded p]nglish text is 
found to be 

'^P(x,)L(xi) = 0.0 bits per letter 
alphabet 

The Morse code is based on a compromise between two objectives 
(Bell, p. 109), to assign the easiest symbols to the most freciuent letters 
and also to assign fshorter codes to the more frcHiuent, letters. For 
example, the letter E, which is the most frequently used letter in the 
English language, is represented by a single dot. If the letters were 
encodcvl stric.tly iii accordance with the criterion of the shortest symbol 
for the most fre(]uent letter, the average cost per letter of English would 
be reduced to 5.55 bits. The Sha!!non-Fano encoding procedure has 
been used (Bell, p, 04) in deriving Table 4-1. 

HutTinan\s optimum encoding pro(;edure has been applied in a direct 
manner to obtain the code given in Table 4-2. E. N. Gilbert and E, F. 
Moore have d(U’ived other types of encoding for English texts. They 
have obtained some interesting binary codes with the prefix property 
(alphabetical codes). The cost of the best of these codes is close to the 
optimum cost given by Huffman’s method. An article by Gilbert and 
Moore contains an alphabetical code with a cost of 4.1978 compared with 
4.1195 of Huffman’s code. According to these authors an alphabetical 
encoding might be used as a means of saving memory space, in a data- 
processing machine in general and in a language-translating machine in 
particular, if it were desired to preserve the conventional alphabetical 
order of dictionaries. 


PROBLEMS 

4-1. Verify if the following gets of word lengths may correspond to a uniquely 
decipherable binary code. 

(a) 

(b) 


\W\ - [0,2,3,21 
[W] = [0,2,2,2,51 



186 


DISCRETE SCHEMES WITHOUT MEMORY 


4 - 2 . In Fig. P4-2 earh box reprcaenia a message output of an independent source. 
The probability of the transmission of each mc.s.sago is known to be where A: is a 
parameter for each message as indicated m the figure 
(a) Devise a binary encoding for this message en.semble. 

(h) Devise a binary encoding with the lowest average length. 


B 

4 

4 

2 

1 

H 

1 

H 

4 

8 

B 

4 

8 


4 4 


Fui. Pl-2 


4 - 3 . See whether it is possible to encode 105 messages in si'parablo vrords with 

(а) D = 

(б) D = 4 

4 - 4 . A source without memory has six characters with the following associated 
probabilities : 

[A, B, C, D, E, F 1 
IM, }i, H, H, M2, yi2\ 

(а) What is the entropy of this sourcje? 

(б) Devise an encoding procedure with the prefix property giving minimum possible 
average length for the transmission over a binary noiseless channel. What is the 
average length of the encoded messages? 

4 - 6 . Consider a BSC with PUIO) = PIOll) = The input to tlie channel con- 
sists of four equiprobable words 

mi 111 

W2 10 0 

Vl3 0 10 

Wd 0 0 1 

(a) Compute 1 | and 7^0) at the input. 

(b) Compute the efficiency of the code. 

(c) Compute the channel capacity. 

4 - 6 . (a) Apply Shannon’s encoding procedure to the following set of messages : 

[mi, m2, ms, mi] 

[0.1, 0.2, 0.3, 0.41 

(6) Determine the efficiency of the code in each case. 

(c) If the same technique is applied to the second-order extension of this source, 
how much will the efficiency be improved? 



ELEMENTS OF ENCODING 


187 


4-7. Same questions as in Prob. 4-6 for the following set of messages : 

Ni, m2, mj, 7?i4, msl 

IH, }U, ■%! 

4-8. Apply the Gilbert-Moorc techniques for encoding tlie messages listed in Prob. 
4 - 7 . 

4-9. Answer all parts of Prob. 4-6 for the case where the Gilbert-Moore technique is 
employed. 

4-10. Given a discrete source with the following messages: 

|m| = [nij, VI 
p\ vi I = 10.9, 0.11 

(a) Derive a SLaniion (!odc for the above messages. 

{b) Find L and the code efhciency. 

(r) Do parts (a) and (b) for the second-order extension of the source. 

(d) Do parts (a) and (6) for the third-order extension of the source. 

4-11. For the binary Huffman code, prove that 

^(X) <L < //(X) -f 1 - 2p,„,a 

where pmvn is the smallest probability m the message probability set. 

4-12. Find the figure of merit of a Hamming’s single-error correcting code for a 
BRC with p = 0.01 in the following cases: 

(a) Number of information digits is 4. 

(/;) Number of information digits is 11. 

(c) Number of information digits is 26. 

4-13. Find the figure of merit of a Hamming’s double-error correcting code for a 
BSC. 

4-14. (a) find an ojitimum binary encoding for the following messages: 

(Ti, X2, Xn] 

[h^ H, Ho] 

(b) Encode the output of the second-order extension of the source to the channel 
in an optimum binary code. 

(c) Determine the coding efficiency in (a) and (6). 

(d) What is the smallest order of the extension of the channel if we desire to reach 
an efficiency of 1 — 10”^ and I — 10“^, respectively? 

4-16. A pulse-code communication channel has eight distinct amplitude levels 
\x\jX 2 , . . . ,J8]. The respective probabilities of these levels are [pi,p 2 , . . . jpd- 
The messages arc encoded in sequences of three binary pulses (that is, the third-order 
extension of the source). The encoded messages are transmitted over a binary 
channel (pyq). 

(a) Compute H(X). 

(b) Compute H{X\Y). 

(c) Compute /(X;y). 

(d) Calculate (a), (/;), and (c) for the numerical case, where 


(1) Pi = Pi ^ Pi ^ H 
Pi ^ Pb He 
P7 = M 

p = 0.9 


(2) Pi = = P3 = H 

Pa ^ Pb = He 

P7 = H 

P8 “ Me 

p = 0.99 



188 


DISCUETE SCHEMES WITHOUT MEMORY 


4-16. \Vn wish to transmit eight blocks of binary digits over a HEC. The first 
three positions arc used for the information and the rest for parity checks. The 
following (Kpiations indicate the relations between information and parity digits. 


Xa 


1 1 o' 


Jl 

Xb 

= 

0 1 1 


X'l 



1 1 1 


_/3_ 


(a) netermine how many combinations of single and double erasures may be 
corre(*tcd. 

{()) Find the average erasure per block after correcting all possible single and 
double errors. 

(c) What is the average rate of information over the channel? 

4-17. Apply Hamming's single-error correcting in the following cases: 


(o) 

m = 2 

/: = A 

ib) 

m = 4 

k = 3 

(c) 

vt = 4 

k = 4 

(d) 

in = f) 

k = 4 

(e) 

= 11 

k = 5 


4-18. Show that for a Huffman binary code 

H <L < H \ 

4-19. IVove that HulTinan’s encoding for a given aljihabet has a cost which is less 
than or er]ual to that of any uniipicly decipherable encoding for that alphabet (see 
Chlbert-Moore, thcoreiu 11). 



PART 2 


CONTINUUM WITHOUT MEMORY 


Tlie sciuncu of physics docs not only j^ivc us [nnLth('tnaticia,ns| an o|)])or- 
tiinity to solve problems, but helps us also to discover tlu' mc'iiiis of solvijifi; 
them, and it does this in two ways: it leads us to anticipat(‘ lfi(‘ solution and 
su^tijests suitable^ lines of argument. 

Himri Poinear^ 
hii valeur de la science 




CHAPTER 5 


CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 


6-1. Continuous Sample Space. In Sec. 2-15 we presented the concept 
of a discrete sample space and its associated discrete random variable. 
In this section we should like to introduce the idea of a random variable 
assuming a continuum of values. 

Consider, for instance, X to be a random noise voltage which can 
assume any value between zero and 1 volt. Since by assumption the 
outcomes of this experiment arc points on the real line interval [0,1], 
clearly X assumes a continuum of values. Furthermore, if we make this 
assumption, then we may state that X is a random variable taking a 
continuum of values. 

The preceding intuitive approach in defining a random variable is 
unavoidable in an introductory treatment of the subject. On the other 
hand, a mathematically rigorous tr(‘atmcnt of this more or less familiar 
coiKiopt. recpiires extensive preparation in the professional field of measure 
theory. Such a pres('n<.ation is beyond the scope of this book; for a com- 
plete coverage se(‘ llalmos and Loftve. For the time being, the reader 
may satisfy himself with the following. 

As in the case of a discrete sample space, an event is interpreted as a 
subset of a continuous sample space. In the former case we have already 
given methods for calculating probabilities of events. For the continuous 
case, however, it is not possible to give a probability measure satisfying 
all four reciuirements of Eqs. (2-36) to (2-39) such that every sub.sct 
has a probability. The proof of this statement is involved with a number 
of mathematical complexities among which is the so-called “continuum 
hypothesis.” Because of these difficulties in the study of continuous 
sample space, one has to confine oneself to a family of subsets of the 
sample space which does not contain all the subsets but which has enough 
subsets so that set algebra can be worked out within the members of that 
family (for example, union and intersection of subsets, etc.). Such a 
family of subsets of the sample space SI will be denoted by SF (5 stands for 
the mathematical term field). More specifically, the events of ff must 
satisfy the following two requirements: 

131 



192 


CONTINUUM WITHOUT MEMORY 


1. If ^ 1 , ilj, . . . G 51, then 

0 G ST (5-1 ) 

1 

2. If A G then 

U - A (5-2) 

The first property simply implies that the union of a denumerable 
sequence of events must also be an event. The second property requires , 
that the complement of an event also be an event. 

With such a family in mind, the next step will be to define a probability 
measure P\A] for every event A of that family. This can be done in a 
way similar to the definition of a probability measure over fhe discrete 
sample space, namely, 

1. For each /I G 5^, 

0 < (5-3) 

2. For all deiuimerable unions of disjoint events of 5 family, 

oo 

P{Cj ^,1 = Y P{A,} (5-4) 

3. P|(/}=1 (5-5) 

We assume the validity of these axioms and then proceed with definiiip; 
the probability distribution and density of a continuous random variable. 

It is to be noted that, in the strict sense, a random variable need not 
be real-valued. One can directly define a complex-valued random varia- 
ble through two real-valued variables: 

A' -h v-i Y 

This simply requires the measure space to be a complex two-dimensional 
space rather than an ordinary real space. 

6-2. Probability Distribution Functions. For simplicity, consider first 
the case of a random variable taking values in a one-dimensional real 
coordinate space. The probability that the random variable X assumes 
values such that 

= la < X < 5} (5-6) 

is shown by P{a < Z < 6} = P{E\ 

The event E in Eq. (5-6) consists of the set of all subevents such that their 
corresponding values of X satisfy the above inetiuality. In particular, 
consider the event Ei defined by 

El = {X < a} — OD < a < + (X) 

EiE = 0 

BUEi = |X < b) 


Note that 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 193 

Therefore, according to Eq. (2-G9), 

P{EiUE\ = P{E,] + P\E] 

That is, PIE] = P{X <b] - P{X < a] (5-7) 

In general, if x is any real number, we may write 

P{X <x} =F{x) (5-8) 

F{x) is called the probability distribtition function, or cumulative disfrihu- 
iion function (CH^F), of the random variable A^. Note that the dis- 
tribution function is defined for all real values of x. Jt is a monotonic 
nondecreasing function of x continuous on the right fur every x. The 
following two properties of the distribution function are evident in tlie 
light of Eqs. (2-38) and (2-39) : 


lim F{x) = 0 

■ » 00 

lim F{x) = 1 


(5-9) 


Any monotonic nondecreasing function contimiovs on the right for every x 
satisfying Eq. (5-9) can be regarded as a distribution function. 

I'liere are two important classes of CDF. Although they do not cover 
all possible cases of CDF, they are the most important ones: (1) discrete 
and (2) continuous. A random variable and its CDF are said to be 
discrete if th(5 only values that the variable can assume with positive 
probability are at most denumerable, fn other words, if therc^ exists a 
denumerable sequence of distinct numljers Oj {j =1,2, . . .) such that 
^ P{X = Uj) = 1, then the CDF is defined as 

j 

F{x) = ^ P\X = aj] tti < a-i < Urj < • • • (5-10) 

Oj <X 


The binomial and Poisson’s distributions discussed in Secs. 2-18 and 2-19 
are the most common examples of the discrete case. 

When the random variable and its CDF admit a continuum of values, 
they are said to be of the continuous type. The normal distribution of 
Sec. 5-4 is the most common example of the continuous case. 

Example 6-1. Suppose that in Example 2-28 the random vanal)lr X takes on 
any one of the values 1, 2, 3, 4, 5, 6 with equal probability ‘ ^ Then for x < 

kl 

F(x) =P\X <x\ = 2 ^'1^ = *! = 

1=1 


where [j] denotes the greatest integer smaller than or equal to a chosen x. Figure 
F5-1 shows the^graph of the discontinuous function Fix). Note that the probability 
distribution function rises by jumps in the case of discrete variables. 



194 


rONTINXTXTM WITHOXTT MEMOHY 



Fici. E5-1 

Example 6-2. Assiimr^ tluit wo havo a oirculnr disk with a oiioiimforonoe of unit 
length. A com])lotoly synirnetrical pointer at tlie eenUT of the disk is wliirled. The 
pointer stops at a point on the periphery Let X lie the reading of (lie pointer from 
the point zero. If jKirfect symmetry is iissnined, the pro}»al)ility of A' tiring in any 
interval is proxiortional to th.e haigth of that mteival, i.i*., 

P\a < X < h] = K{b - a) 0 <a <b < 1 

K being a constant of proportionality. Accordingly, 

/"I - CO < A^ < 01 = 0 
P|I < A’ < 00 1 =. 0 
< A' < 1 j = K 

Hence, = 1 and 

F(t) = PIO < A' < j-1 = j- 0 < J- < 1 

6-3. Probability Density Function. If F{x) is such that 

F(x) = (5-11) 

where /(/) is a real-valued ijitograble fund ion, llien F{.v) is said to be 
absolutely continuous. II is known that almost everyxvhere 

/(•<■) (5-na) 

In case F{x) is an absolutely continuous CDF, we have 

P{a<X <h\ = f^ix) dx (5-12) 

f{x) is known as the probability density function (PDF). 

Since the probability distribution function is a nondecreasing mono- 
tonic function, the density function will be noiinegative over the real axis; 

(5-13) 


S{x) > 0 



CONTINUOUS PKOBABILITY DISTRIBUTION AND DENSITY 


195 


Furthermore, 

dx = F{+<X>) - F{- co) = I ( 5 - 14 ) 

If F{x) is a continuous function about x = a, the probabilil.y of X assum- 
ing the value x = a is zero. In fact, 

P{X = aj = lim P \a — f < X < a] (5-15) 

for e— »0 

P{-Y = al = lim [P(a) - F(a — «)] = 0 (5-lG) 

e-»0 

For continuous random variables the probability of the random varia- 
ble being in an interval decreases with the length of that interval and in 
the limit becomes zero. 

For a discrete random variable, if X = a is a possible value for the 
random variable, then 

P{X = a) = lim [F(a) - F(a - e)\ 0 (5-17)^ 

The mathematical implication of this equation is rather clear. How- 
ever, the engineering-minded reader may find it convenient for his own 




Fi(i. 5-1. Example of a discrete CDF. Fig. 5-2. Example of the dis(;rete distri- 

bution corresponding to Fig. 5-1 in 
terms of impulse functions. F(X = a) 
= a. Pix = b) = 0. p(x * c) =;7. 
a + l3 -h 7 = 1* 

use to illustrate the density distribution function in the discrete case 
with the help of Dirac or impulse functions, t A unit impulse effective 
at a point x = a will be denoted by Uo{x — a). 

The discrete probability distribution function of Fig. 5-1 leads to the 
density distribution function of Fig. 5-2. 

* When Healing with continuous probabilities due to P]q. (5-17), we understand that 
the expressions P\X < aJ and P\X < a| are equivalent. 

t While the rigorous use of Dirac “functions" requires special mathematical con- 
sideration, their jimployment is frequent and commonplace in electrical engineering 
literature. In this respect, Fig. 5-2 may be of interest to the electrical engineer. 


196 


CONTINUUM WITHOUT MEMORY 


Example 6-3, A random process gives measurements x between 0 and 1 with a 
probability density function 

/(x) = 12x> - 21x2 -h lOx 0 < X < 1 
fix) — 0 elsewhere 

(а) FindP(Z < 1^! andPIX > H). 

(б) Find a number K such that P\X < 

Solution 

(a) PIX <]i\ = (12i* - 21 j:» + lOi) dx = 

P\X > 1,^1 = 1 - = Ke 

(h) (12x» - 21i* + IOj) dx = ' 

3K* - 7A'» + 5A2 = 

The permissible answer is the root of this equation between 0 and I ; this is found to be 

K = 0.452 

B-4. Normal Distribution. A random variable with a cumulative 
distribution function given by 

fW . /-I.Y < .1 - /;_ ; »x,. [ - (5-18) 

is called a variable with normal or gaussian distribution. The corre- 
sponding density function is 


/(^) = 


1 r 

= - - . _ exp 

(T ^2Tr 


(x — a)2" 


(5-19) 


which is symmetrical about x = a. 

The numbers a and a are called the average and the standard deviation 
of the random variable, respectively; their significance is discussed in 
Chap. G. 

One may be interested in checking the suitability of the function f{x) 
of Eq. (5-19) as a density function. In other words, one has to show that 


/_ 00 dx = I 

For this purpose, we may shift the density curve to the left by a units. 
Next, we consider the double integral 



L 

0- \/2t 



1 

2Tr<T^ 

1 


2ira^ 



dx 1“ e-y'i^<’'dy 

dx dy (5-20) 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 


197 


rinally, introducing the familiar polar coordinates yields 



- dx 

(T \/27r 



r drdd ^ I 
^2 


( 5 - 21 ) 


In the next chapter, it will be shown that the letter a in Eq. (5-18) 
denotes the "'average^' value of the random variable with a normal 
distribution. 

When the parameters of the normal distribution have values 


a = 0 and <7 = 1 


the distribution function is called fitandard normal distribution. 

F{x)=P{X<x}= r (H (5-22) 

J « V 

Tables of standard normal distribution are commonly available. To 
use such tables one first (anploys a transformation of variable of the type 


in Ecp (5-19) in order to transform the normal curve into a standard 
normal curve. Standard normal distributions are given in Table T-2 of 
the Appendix. The use of this table for evaluating the probability of a 
random variable being in an interval is self-explanatory. All one has to 
remember is that Eq. (5-12) suggests the ecpiivalence between proba- 
bility and the area under the density curve between points of interest. 
For example, if X has a standard normal probability density distribution, 
then 

PjO < X < 2) = 0.47725 
P{-2 < X <2] = 0.95450 

P|(X < -2) U (X > 2)1 = 1 - 0.95450 = 0.04550 

/"(X < 2) = 0.97725 

rjx > 2} 1 - 0.97725 = 0.02275 

More detailed information is given in Table T-3 of the Appendix. 

Example 6-4. The average life of a certain type of (‘leclrie bulb is 1,200 hours. 
What percentage of this type of bulb is expected to fail in the first 800 working hours? 
What percfmtage is expected to fail between 800 and 1,000 hours? Assume a normal 
distnhution with a = 200 hours. 

Solution. Referring to Sec. 2-8, one notes that in a large number of samples the 
frequency of the failures is approximately equal to the prohability of failure. In this 
connection the word percentage is used synonymously with frequency. Using the 
average life of 1,200 hours for a, we make a change of variable y = (x — a) ja which 



198 


CONTINUUM WITHOUT MEMORY 


allows the normal curve in y to be symmetrical about j/ » 0, hence permitting the use 
of a table of normal probability. 


2/0 


/. 


-1.200 1 

■800 200 \/^ 


exp 


[- 


- a _ 800 - 1,200 _ „ 

tr 200 ^ 

(x - 1,200)2 


80,000 


1 dx = r e-»='» dy = 0.477 
J Jo W 2 ir 


The area under the whole normal curve being unity, the desired piobability is 

0.500 - 0.477 = 0.023 
For the second part of the problem, let 


\ 

\ 


1 


1,000 - 1,200 
’ 200 ^ 

f(y) dy = 0.341 

0.500 - 0.341 = 0.159 
0.159 - 0.023 = 0.130 


The reader should note that, in view of our assumption of normal distribution, there 

is a fraction of bulbs with negative life 
expectancy (— « to zero). This fraction 
is included in th«‘ above calculation; that is, 
the number 0.023 — P ( — oo < X < 800 ) 
will be somewhat larger than 

P|0 < X < 800} 



Of course, if we had used a density 
distribution bounded between 0 and 
infinity, we should not be confronted with 
the problem of negative life expectancy. 
On the other hand, tables of such distri- 
butions are not riaidily avnilabh'. The 
calculation of 7^1800 < X < 1,200 1 in lieu of 7^{0 < X < 800} was a siniph* matter 
of using Table T-2 of normal distributions. 


Fkj. E5-4 


6-B. Cauchy’s Distribution. A random variable X-is said to have a 
Cauchy distribution if 

The corresponding density function is 


fix) = 


f: 


7r(l + I®) 


1 


fix) dx = - [tan“‘ = 1 


(5-24) 


Note that 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 109 

The graphs of the corresponding density and CDF are shown in Figs. 
5-3 and 5-4, respectively. 



ax a 


Fig. 5-3. A normal PDF. Fio. 5-4. A normal CDP\ 

Example 6-6. Consider a point M on tlic vertical axis of a two-dimensional rec- 
tanRular system with OM = 1. A straight line MN is drawn at a random angle 8 
(Fig. E5-5). What is the probability distribution of the random variable ON — X? 



Fig. E5-5 


Solution. The prol)lem suggests that the angle 8 in Fig. E5-5 has a uniform prob- 
ability distribution. Thus the probability of drawing a line in a particular d8 is d8/iT. 
The random variable of interest is ON = X ^ tan 8. Accordingly, 


P{x - dx < X < j\ = P{8 - d8 < angle < e) = — = 

/w - STTrt 

F{x) - i ltan-> Til. = 1 ^tan-' x + ^) 

6-6. Exponential Distribution. A probability density distribution of 

the type 

f{x) = dx a > 0 x > 0 

fix) =0 elsewhere 


(5-25) 


200 


CONTINUUM WITHOUT MEMORY 


is referred to as an exponential divStribution. The corresponding CDF is 
given by 

F[X < x] = acr^^ dt = [ — c““*]5 

F{x) = 1 - (5-26) 

A graph of the exponential density and its CDF are given in Fig. 5-5a 
and b, respectively. 




Fi(i. 5-5. (a) lOxanipIo of an (’xjKiiU'iitial PDF. (h) C'J)F of tli(' donsity illustratod 
in Fig. 5-5a, 

6-7, Multidimensional Random Variables. The coordinate space can 
be a multidimensional space. In this case the random variable X 
assumes values of the type that is, n-tuples of real 

numbers. For example, if four dice of different colors are thrown simul- 
taneously, the random variable associated with the outcome takes certain- 
number quadruples as values. In fact, we are considering a sample space 
that is the cartesian product of a finite number of other sample spaces. 
If the outcome Ek is in the sample space ilky the n-fold outcome E is 
defined as a point in the cartesian product space S2, that is, 

Ek /c = 1, 2, , . . , n (5“27) 

E = {E,,E2y . . . ,E„] 

0 122 (S) • • • ® 12n 

Then E G « 

If the outcome F is a permissible point of the product sample space 12 
and if the events Ek are mutually independent (this is not always the 
case), then the probability measure associated with E equals the product 
of the individual probability measures, i.e., 

m{E) = m{Ei)m{E 2 ) m(En) (5-28) 

It is to be noted that by the event Ek G 12^ we understand the set of all 
events in the 12 space where the fcth random variable assumes a specified 
value but variables other than that are arbitrary. Such a set of events 
is usually called a cylinder set. The probability measure associated with 
this cylinder is defined as the probability of the event Ek. 

By analogy with the one-dimensional case, we define the cumulative 
probability distribution function (CDF) of the n-dimensional random 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 


201 


variable (Xi,X 2 , . . . ^Xn) as 

F{Xi,X 2, . . . ,Xn) = P( — 00 < Xi < Xi, - 00 < X 2 < X2, 

■ • ■ j X n Xn I 

= P{X^ < xi, X 2 < X2, . . . , X„ < xj (5-29) 

Now we explore this defining equation for the two most important cases of 
continuous and discrete variables. 

Continuous Case. The (T3K is also defined by 

F(xi,Xi, . . . ,x„) = ■ ■ ■ 

. . . ,in)dl,dl, ■ ■ • dtn (5-30) 

where /(xi,X‘j, . . . ,x„) is the probability density function. Note that 

/-» ■ • • >^») = 1 (o-3i) 

The study of n different random variables A"i, .V 2 , . . . , Xn is equiva- 
lent to the consideration of one n-dimensional random variable 

X - (X,,X2, . . . ,Xn) 

The one-dimensional variabh's Xi, , A"„ are said to be inde- 

pendent if for all permissible values of the variables and all joint CDF^s 
we have 


F(Xi,X 2, . . . ,Xn) = Fi(.ri)F2(.r2) Fn(Xn) (5-32) 

where F^(x^) denoti^s the cumulative distribution function of the one- 
dimensional random variable that is^ 

- /:. ■ ■ ■ /:. ■ ■ ■ 

/_^/(Xl, . . . ,J., . . . jXn)dx^(h\ ■ dXn 

^ P{ — ^ < A^I < 00 , . . . , - coj < A\ < :r,, 

. . . , - 00 < Xn < ^ 1 (5-33) 

F^{x^) is called the marginal probability distribution function of A^, in other 
words, the cumulative distribution of Xt irrespective of the values 
assumed by other variables. 

When the density functions (continuous type) exist, the condition of 
independence [Eq. (5-30)] can be written in the equivalent form: 

/(Xi,X2, . . . ,Xn) = MxOfziXz) * • • fn(Xn) (5-34) 

where is the density function of the random variable X^, without 
regard to other variables. This is the so-called “marginal density func- 



202 


CONTINUUM WITHOUT MEMORY 


tion” of the variable Xi. The condition must be satisfied for all permis- 
sible values of the variables. 

Discrete Case. In this case the random variable 

X = (Xi,X2, . . . ,Xn) 

takes on only a denumerable number of n-tuplcs as values such that the 
total probability is concentrated in only a denumerable number of points 
of the w-dimensioiial space. It is obvious that in this case each of jthe 
component random variables Xi can take only dcnumerably mlj,ny 
values and that the marginal distribution of Xi, 

FiM = < aril (S-sb 

is also discrete. The definition of independence can now be given as 
before in terms of the marginal distribution. 

As is easy to see, it is much more convenient in the discrete case to 
specify probabilities of the type 7^{Xi = Ai, . . . ,X„ = An} rather 
than to give the analytical form of the CDF. 

6-8. Joint Distribution of Two Variables : Marginal Distribution 
Continuous Case. The case of a two-dimensional random variable 
(X,F) is of considerable interest and will be treated in some detail. Let 
f{x,y) be the corresponding probability density function; then by a reasori- 
ing similar to that of Sec. 5-2 we find that 

P{ai <X < hi, ai<Y< hi] = JM dx dy 

J a\ J 02 

PI - oo < X <x, -o, <Y<y] = F{x,y) 

= /(•*■, 2/) dx dy (5-36) 

with the understanding that /(x,j/) > 0 

and f{x,y) dx dy = 1 

Differentiating Eq. (5-36) partially first with respect to x and then 
with respect to y gives 

= P K-r,t)dt 

dx y-. ^537) 

The latter equation is in a sense a “duaF’ expression for Eq. (5-36). 

The probability of the variable {X, Y) assuming values in the rectangle 
{ai < X < hi, 02 < Y < 62) can be directly computed. Consider the 
following sets of events in relation to Fig. 5-6. 

Ei= [X < hi, Y < bzl E 2 = [X < ai, Y < 02] 

E,= [X < hi, Y < 02] £4 = IX < ai, F < 62) 


(5-38) 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 203 
The event of interest can be written as 

E = {ai< X < 5i, 02 < F < 62 } (5-39) 

Now keeping in mind that the probability P\E] would correspond to the 



y 



*2 




«2 







bi ” 


Fio. 5-6. Cortain events associated with the sample space of a random variable. 

double integral of the density over the above region, from Fig. 5-G one 
concludes that 

P[E} = P{ai < X < 5i, 02 < F < 62) = F{bi,h,) - F(6i,a2) 

+ F(ai,a2) - F{aM (5-40) 

The probability P[X < x), irrespective of the values assumed by the 


second component F, can be written as 

f i(x) = P{X < x| = f{x,ij) dy (5-41) 

Fi(x) is defined as the marginal distribution of the variable The 
marginal density Junction can be obtained in the following manner: 

The marginal probability distribution and density function for the variable 
F can be given in a similar manner: 

F^iy) = P{Y <y\ = j^^dy jjj Kx,y) dx (5-43) 

My) = = j’jM 

Note that the marginal distributions can be alternatively written as 


F^{x) = F(x,co) 

Ft{y) = F(«,j/) 


(5-45) 



204 


CONTINUUM WITHOUT MEMORY 


When the two variables X and Y are independent, then, for every pair 
of and (02,62), we have 

P{ai < X < 61, 02 < y < 62I = P(ai < X < 61I 

/"I02 < y < 62I ( 5 - 46 ) 

F{x,y) = F^{x) • F^iy) ( 5 - 47 ) 

Note that for continuous distributions when the l-vvo variables are 
mutually independent for all possible pairs of (x,//) j 

Kx,y) = fi{x) ■ !i{y) ( 5 - 4 'l^) 

Discrete Case. Let the random variable (X, y) take the values 




j=\, 2 , . . . 
fc = 1,2, . . . 


and \e,i P\X = x„ Y = yk\ = P{j,k\. Then the CJDF of the random 
variable is 

F(x,y) = ^ P{j,k] ( 5 - 49 ) 

Xj < X 

Vk < y 


and one can also calculate the marginal probabilities as before. Note 
that if one or both of j and k are finite in number, the calculations become 
much easier. 


Example 6-6. Tlie density function of a two-dimensional continuous distribution 
is given as 

J{x,y) = for ^ 0, ?/ > 0 

f(x,y) = 0 elsewhere 


Find the probability of (3^ < X < 2; 0 < F < 4). 

Solution. First we observe whether the given dtuisity function is a permissible one. 
In fact, fix,y) is nonnegative in the specified range and 

/o" fo° ^ /o* (lx = 1 

The desired answer is given by 

fy Jo dx dy — 


6-9. Conditional Probability Distribution and Density 

Continuous Case. The conditional probability distribution for two- 
dimensional random variables can be derived in a way basically similar 
to that of the conditional probability defined in Sec. 2 - 8 . However, the 
required mathematical care is beyond the level of an introductory presen- 
tation. An accurate account of the conditional probability distribution 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 205 


and density is given in Loeve (p. 359). The following presentation, often 
given in introductory texts, lacks adequate mathematical rigor. 

The conditional probability of P( F < y] subject to (x — Ax < X < x) 
can be written as 


P{Y < y\x - Ax < X <x}=^^- 


- Ar < X < ^ Y < 
P[x — Ax < X < x] 


_y\ 


(5-50) 


The difficulty with this elementary presentation stems from the fact that 
the denomiiialx)!’ in general might be zero. In the above relationships one 
may now introduce the appropriate probability density functions, when 
such funct ions exist. 

. r f{^,y)dxdy 

P{Y < y\x - Ax < X < x\ = (5.5I) 

/ . /i(-c) dx 

J T~ Ax 

Taking the limit as Ax approaches zero yields 

f “ f(-r,y) dy 

lim P{Y < i)\x - Ax <X <x\ = - (5-52) 


If we denote the left-hand member of this equation by /^{ F < 7y|x| and if 
the derivatives of both sides are taken \vith respect to the variable quan- 
tity ?/, the following results: 

dy ' /j(x) 

We define /x(7y|x) = d]^\Y < y\x\/<iy as the conditional probability 
density function of F, given A^, Similarly, 

- W 

The conditional density }x{y\x:) is a function of one variable y and the 
parameter x which assumes a given value in each case. If the conditional 
density /i(7/lx) does not depend on the parameter x, and F are mutually 
independent. 

Discrete Case. In the discrete case we do not have the conditional 
densities. The familiar form discussed in Sec. 2-8 is quite satisfactory 
for the calculation of the various conditional probabilities. 

Example 6-7. Consider the two-dimensional density function 


f{x,y) = 2 
fix^y) = 0 


for 
outside 


ro < X < 1 
to < 7/ < z 


(а) Find the marginal density functions. 

(б) Find the conditional density functions. 



206 


CONTINUUM WITHOUT MEMORY 


Solution 

(o) 


( 6 ) 


fi(x) = dy = 2 dy = 2x 

= 0 outside 

dx 

0 outside 

i{'r 11 ^ 1 

0 < I < 1 


0 < I < 1 


My) 


i:- 


2dx = 2(1 - y) 0 < < 1 






A pictorial interpretation of conditional probability is given in Fi^, 
5-7. Let ii be a region within the range of definition of the density 
fiin<!tion of a two-dimensional random variable (also called bivariate). 

We ask ourselves what is the proba- 
bility of the variable Y being in the 
interval 



//u < Y < //o + At/o 

given that < X < j-q + Axo. As- 
suming that all densities exist, the 
required probability is 


5-7. Illustration of difTerent 
distributions associated with a two- 
dimousional random variable. 


f{xo,yo) Axo A//q 

ATi) 


(5-55) 


For arbitrary points of the region Ky this ratio divided by A//o is the 
conditional density for //, given x, and is designated as 


mx) = (5-50) 

A further justification may be desirable for showing that the function 
in Eq. (5-56) satisfies the requirements for a probability density function. 
In fact, since the numerator and the denominator are essentially non- 
negative, the above ratio is a finite nonnegative number [assuming 
fi{x) ^ 0]. Furthermore, 


6-10. Bivariate Normal Distribution. In this section, we shall con- 
sider in some detail the most frequently used two-dimensional distribu- 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 


207 


tion, namely, the bivariate normal distribution 

M»).Cexp[-(g-2f + 1 ;)] (MS) 

a, by and k are given positive constants. The analysis of the bivariate 
normal distribution will be divided into the following three parts : 

1. Determining the value of the constant C 

2. Determining the marginal densities 

3. Determining the conditional densities 


r - 


1 . Tn order that /(x,?/) be a density function one must have 

1 


/ ^ f ^ ^ -[(.rVa^) - 2ixy/k) + {y‘^/b^)\ \ (kr dij 


= j- = 1 (5-59) 


Note that 


' ■ “P [ - (si - p) '’] [ - (f " t) ] 

A change of variable is in order. Let 


Then, 


/ - evp [ - (1 - ^ x=] dx 1'^ >.■'=> d, 

I = b Vir exp ~ {b-63) 




Finally, 


C = y/k^- - 0252 

TTttOfc 


(5-04) 


2 . For marginal densities, 

Ux) = M dy = C exp [ - - 2 dy (5-05) 



208 CONTINUUM WITHOUT MEMORY 

A change of variable yields the following results: 


with 


/iW 

J2{y) 


Gx y/'l'K 
Gy \/ 27 r 


1 a^/c 2 _ , _ 1 

2 P - oH)'^ “ 2 




(5-66) 

(5-67) 

(5-68) 


The marginal density distril>utions are also uoiinal distributions. 

3 . The conditional probability densities, as obtained from the join^ 
and marginal densities, are 

[ " 2 (^‘^ - ivin ~ I-/} 




fiU) ' (T„ - ay/) 


'■') . f ( 5 . 

2(A''^ - «-’/)-) koj \ 


For any given value of x or //, the associat'd conditional densities as well 
as the marginal densities are normally distributed. 

6-11. Functions of Random Variables. One of the most fundamental 
problems in mathematics and physics is the problem of transforming ii set 
of given data from one coordinate frame to another. For instance, we 
may have some information concerning a variate X = (X 1,^2, . . . ,A%,), 
and, knowing a function of this variate, say F = ^(X), we wish to obtain 
comparable information on function F. The simplest examples of such 
functions are given by ordinary mathemati(!al functions of one or more 
variables. In the field of probability, knowing the probability density 
of a random variable X^, we dc'sire to find the density distribution of a new 
random varialfic F = g{X), A momejit of reflection is sufficient to 
realize the significance of such queries in physical problems. In almost 
any physical problem, we express the result of a complex observation or 
experiment in terms of certain of its basic constituents. We express the 
current in a system in terms of cei*tain parameders, say resistances, 
voltages, etc. Thus, the problem generally reciuires computing the value 
of an assumed function, knowing the value of its arguments. The 
computation of interest may be of a deterministic or a probabilistic origin. 
Our present interest in the problem is, of course, in the latter direction. 

First we shall consider the case of a real single-valued continuous 
strictly increasing function. Then the procedure will be extended to 
cover the more general cases. 



CONTINirOUS PROBABILITY DISTRIBUTION AND DENSITY 


209 


One-dimensional Case. Let X be a random variable with CDF F(x) 
and let g{x) be a real single-valued continuous strictly increasing function. 
The CDF f)f the new random variable Y = g{X) can be easily calculated 
as 

GOy) = r[g{X) < ^yl = 1 \i ij > g{+ 

G{y) = j'\ ^ )_< (5-7 1 ) 

G{y) =0 if ?y < g(- '») 

g~^{y) being the inverse of g{y). 

Thus, G{y), the CDF of the new^ variable, is completely determined in 
terms of F(:r) and the traiisforniation g{x). If A" has a density /( t), the 
density of Y = g{X) can easily be found as 

= r (5-72) 

dfi 

When this latter integral exists, the density function of Y is 
p(?y) = 0 for y < g{- 

piy) =0 y > (/(+ QO) 

Because of the diversity of the cases encoiintenid in practice, it is 
advisable not to use any ready-made formula for obtaining the density 
function of the ik^w variable. ]' or this reason the examples given below 
have been worked out directly from the definitions. However, in most 
cases the proper application of E(ts. (5-72) and (5-71^) is adequate. 

As a particular application of this problem, consider the case of a linear 
half-wave rectifier followed by a hypothetical amplifier. The output 
signal is simply 

Y = AX X > 0 

F = 0 X < 0 

For the positive values of X we have 

- wKj) 

For example, if X has a normal distribution with zero mean and a stand- 
ard deviation, then 


2 / > 0 



210 CONTINUUM WITHOUT MEMORY 

It should be noted that 

< X < 0) = = 0) = 

The density of Y consists of the above continuous distribution and 
discrete probability of \'2 applied at 7/ = 0. 

Example 6-8. Find (a) the distribution and (h) the density funetions for 
y = aX -f 6 o 5^ 0, 6 real 

assuming that F(x) and f(x), the distribution and the density of X, are known. 
Solution 

(a) Distribution funetion: 


G(i/) = PjaX h < i/\ a ^ 0 

G(„)=/>{A'<«-T.''j .f„>o 

GW = (a- ifa<0 


(6) Density function : 

/ (y-h)/a 

f(l)dl 


if (1 >0 

= /■" 

J - CB a \ a / a 

G{y) = f fit) dt if a < 0 

J in - h) /a 

The density function in botli east's is given by 



Example 6-9. Find (a) the distribution and (b) the density funetions for 

y = 


Solution 

(a) Distribution function: 


G(y) = < 7/1 = 0 ?/ < 0 

G{y) = F(ln y) ?/ > 0 

(6) Density function. If x has a density /(a:), then 


/ In y 


GW = 

Jo » 


pW 

pW 


- 0 

my) 


y > 0 { = In M 





y 


y<o 

y >0 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 211 


Example 6-10. Find the transformation of the variable which changes any given 
density function into a rectangular distribution, i.e., 


p(y) =0 ?/ < 0 

p{y) “I 0 < 1 / < 1 

p{y) « 0 1/ > 1 


Solution. We confine ourselves to the case when Fix), the CDF of X, is strictly 
increasing and continuous. Then consider Y = FiX) as the lu w variable. The 
density distribution of Y is 


piy) = 0 

ld/dT)F(x) 
Piy) = 0 


1 


for 


< F(- =c) =0 

F{ — oc ) <7; < Fi F ) 

(0 <:v < 1) 

>F(-f-«) = 1 


Example of a Simple Nonlinear Device. Consider the nonlinear device 
of Fig. 5-8 with an input-output relationship 


Y = aX^ 


Knowing the probability density function of the input signal X, we wish 



Fir.. 5-8. A nonlinear transformation of a random variable. 


to derive the corresponding function for the output signal. The curve 
Y = aX^ can be divided into two increasing parts. 

P{yQ < Y < f/u + Ai/u} = P{:ro < X < xo + A:ro) 

+ PI —Xu — Axu < X < — xo} 

In the limit when Axo and Ayu tend to zero, 

\ - /(^o) _ fM +/(-3^o) 

\dy/dx\^„;r, \2axo\ 

Note that the probability density of Y is not generally equal to twice the 
density of X, unless /(x) = /( — x). As an example, suppose that X is a 
random signal with normal distribution 


/(X).^«P (-',•) 


— OO < X < 00 




214 


CONTINUUM WITHOUT MEMORY 


If F{x) has a density, then P{X = jrHl/o) 1 = 0 for any arbitrary j/o, and 
by differentiating G(y) we find the density distribution g{y) : 


= dP(r'{y)) 

dy dg-'{y) dy 


giy) = -KrKy)) ;77 


1 


g'irKy)) 

The above two cases can be combined in 

g(.y) = f(s~'(.y)) 


for g'(x) < 0 


for g\x) > 0 


1 




(5-79) 


6-12. Transformation from Cartesian to Polar Coordinate System. 

A frequent application of the material of the previous section occurs in the 
transformation of polar and rectangular coordinate systems. Suppose 
that we are studying the position of a random point M of the two- 
dimensional plane with reference to the rectangular coordinates X and F. 
The position of the point M{XyY) can be considered as a two-dimensional 
random variable. We assume that the joint density distribution /(j,?/) 
is known. The problem is to change from the cartesian to the polar 
coordinate system M{Ry<f>) and to determine the corresponding density 
function p(r,v?). The following equations are self-explanatory: 


Y 

= tan~^ 


1^1 


a /2 dit 
dX a'F 

d<t> ^ 

^ df 


X = R cos <f> 
Y = R sin 0 




(5-80) 


(5-81) 


Thus, according to Eq. (5-70), 

p(r,^) = (x* + y^yfixyv) (5-82) 

For instance, if X and Y are normally distributed independent random 
variables with densities 


(' £•’) (■ i?) 

then their joint density distribution will be 


(5-83) 


(5-84) 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 


215 


The density distribution in the polar system is 




r [ / cos® iP 

2 ^^ [ “ 2 


+ 


sin® 

cr,® ) 


(5-85) 


In engineering literature, this distribution is sometimes called a Rayleigh 
dtsiribuiion. In the particular case when o-* = = a, the Rayleigh dis- 

tribution becomes 


p(r,(p) = 



(5-86) 


This probability distribution function is independent of the direction of 
the point in the plane with respect to the reference coordinate system. 



In other words, 0 has a uniform distribution between zero and 27r. In 
this case the probability of the point M not being closer than a distance d 
to the origin is given by 


// “p ( - £) * h ») * 

The probability of the point M being in the region 


is given by 


Ti R r2 Pi 0 ^ p2 



r 




exp 





+ 


sin® ip\ 


dr dip 


(5-88) 


Note that the joint density distribution of Eq. (5-86) is the product of the 
individual marginal probabilities 1 /27r and 



The latter distribution is illustrated in Fig. 6-10. 


(5-89) 



210 


rONTINUITM WITHOUT MEMORY 


PROBLEMS 

6 - 1 . For whiit, value of K ih the function 

F{x) = 1 — /\ j- " j > j-o a > 0 
F{x) =0 X < J*(i 

a CJ1)F? Find the eorri'Siiondiii^; densily function. 

6 - 2 . L(‘t A" a randoni variable varyiiif* in tlie ranRC 0 < :r < 1 
of K is the funct ion 

fix) = Kxi\ — x) 0 < j < 1 
f(x) = 0 otluTwisi' 

a probability density function? Juir this value of K d(‘termine thi' (d)F. 

6 - 3 . "riie joint density tunction for two random Miriables A' and F is ji;i\ i n below ; 

J{x,y) — for > 0 y > 0 

./V,v) = b ('ls(‘\\ lier(‘ 

Find ' r<l! 

6 - 4 . A random variable has a jnobabibty density distribution as shown m Fir. 
Pr)-4. 

(tt) Find th(' value of tin* constant /. . 

(h) Find the CDF. 

(c) Determine F \ i < x < ' 2 i • 


Fun P5-4 

6-6. TCvaluate the parameter K which w'ill make tin* function 

fix) = T >i) y > 0 

fix) =0 elsewhc're 

a permissible probability density function. Find 

P\\ < X <2,^ <y <2\ 

F\\ <x < 2), Pll <y <2\ 

I^ix,y), f\ix), fiiy) 
f{x\y), fiy\x) 

6-6. The current / in a certain electric circuit is assumed to be a random variable 
normally distributed about its average value of a = 10 amperes with (r = 1. Deter- 
mine the following: 





CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 


217 


(a) Tho probability of the current / beiiiR less than 11.5 aniperes. 

(h) The probability of the current I beiiiR larger than 20 ain])eres. 

(c) The probability of / being between 10 and 20 amperes. 

(d) Tho probability of / being between 0 and 11 ainpcires. 

6-7. In a lot consisting of 200 items 10 are defective. Find the probability that a 
random sample of a = 10 of this lot yields exactly one defective. 

6 - 8 . If the probability that any person 25 years old will die within one year is 
p = 0.01, find the probability that out of a group of 100 siudi persons (a) none, (/;) 
only one person, (c) not more than one person, (d) more* than onc' person, and (c) at 
least one person will die in a year. 

6-9. A random voltage V has a normal distribution 



This voltage is applied to an ideal full-wave rectificT, that is, the output V is given by 
the equation 

V = |A'| 


Determine the density distribution of the output. 

6 - 10 . A certain fluctuating ckrtne current can be eonsidered as a random variable 
whose value is 1 amperes. / is uniformly distributed betw'eeii and 11 amperfs. 
Assuming that this current flows in a 2-ohm resistor, wliat is the density distribution 
of the power, that is, 2/'*? 

6-11. The amplitude of a random noise variable has a normal density distribution 


p{x) = 



P'ind the density distribution for its pow^ei spectfum, that is, )' = A'-. 

6-12. A random voltage A' has a uniform probability density in the intiTval 
— K < X < K (K > 0). A^ IS the input to a nonlinear device with the chsiracteristic 
shown in Fig. P5-12. Find the density distribution of tlie output Y in all thn'c follow- 
ing cases: 

(a) K < a. 

{b) a < K < Xq. 

(C) uTo < K. 



Fiu. P5-12 


6-13. The CDF of a random variable X is given by Fig. P5-13. 

(а) Give the density function f(x) graphically with all pertinent values. 

(б) Determine E{X), 



218 


CONTINUUM WITHOUT MEMORY 



6-14. (a) Determine the value of the constant m which makes the function 
f(x) = muc'* X > 0 


a permissible density function. 

(h) Determine P(X < I). 

(c) Determine P[i < X < 2 ) . 

6-16. Let / be a random current having a normal distribution with menns of 10 
amperes and <r = 1. This current is applied across a J-^-ohm resistor. Find the 
probability distribution of the power dissipated in the resistor. 

6-16. Check if the two variables X and Y with joint density 

4 ^ ^ y ^ ^ 

* 0 elsewhere 


arc independent. 

6-17. Same question for 

f{x,y) = Sxy 0 < X < y, 0 < y < \ 

6-16. A number is chosen at random on a semicircle and projected onto the diam- 
eter. Find the density function for the point of projection. 

6-19. Two independent random variables have the following distributions: 

/i(x) = 1 0 < X < 1 f 2 {y) = ae““» y > 0 

= 0 elsewhere = 0 elsewhere 

(а) Find the density distribution of Z « X -h F. 

(б) FindPlz > 1). 

6-20. The random variable X has an exponential density distribution 
fix) ■* 1 — e“* 0 < X < oc 

Find the density of the variables 

(а) r « 3X H- 2 

(б) Z - 2X - 3 

(c) (/ » X — 



CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY 


219 


5 - 21 . A random variable (X,F) has a uniform probability distribution of in the 
region {\X\ < 1, |r| < 1). FindP(X* + F* < 

6 - 22 . If A, B, and C are uniformly distributed between 0 and 1, what is the prob- 
ability that the equation Ax^ + Bx + C = 0 will have real roots? 

6 - 23 . The random variable X has a normal distribution 



Find the distribution of K = A"’. 

6 - 24 . The density of a two-dimensional random variable (A",F) is given below: 
f(x,y) = j: > 0 !j > 0 

Make the transformation of the random variable to (f^F), 

1/ = A + y 



and derive the new density fiinetion. 

6 - 26 . Let X and 1^ be random variables with exponential distribution on the posi- 
tive real axis. Make the transformation of variable 


U = X A- Y 

V = - — - 

+ F 


(a) Find the probability density p(UfV). 

(h) Are the variables U and V independent? 



CHAPTER 6 


STATISTICAL AVERAGES 


In Chaps. 2 and 5 we introduced the elements of discrete and continuous 
probability theory. This chapter is devoted to an integrated discussidn 
of the concepts of averages (or expected values), moments, and related 
generating functions. In most probability problems we have a number 
of random variables and a set of functions associated with them, l^'or 
example, in the simplest case, we may have a weighting function f{X) 
(also called utility function, cost function, or loss function) associated 
with a random variable X. Then the general nature of questions of 
interest is to obtain what may be called the statistical average or the 
expected value of the weighting function /(.Y), that is, the average value of 
f{X) in the long run when X assumes all its possible values with their 
specified probabilities. 

The computation of average values of random variables and the cor- 
responding cost functions is of considerable interest in physical problems. 
The major part of this chapter will be devoted to the study of such 
expected values. Section 6-2 treats the expected value of the sum and 
the product of a number of functions of several independent random 
variables. Sections 6-3 to 6-8 concentrate on a particular form of weight- 
ing function, namely, X’’, which is of great practical significance as it 
leads to different moments. Later sections of the chapter will be devoted 
to relating these topics to the familiar theory of Fourier and Laplace 
transforms. Finally we shall apply this material in deriving a simple 
form of the central-limit theorem in a subsequent chapter. 

The concept of statistical averaging applies to discrete as well as to 
continuous random variables. At the beginning of this chapter there 
may be a tendency, in proving certain theorems, to employ a discrete 
probability scheme. The results are equally valid for continuous 
schemes. 

6-1. Expected Values; Discrete Case. Let X be a discrete random 
variable assuming the values xi, ^ 2 , . . . , Xnj • ■ ■ , with respective 
probabilities pi, P 2 , . ■ . , Pn, . . . , and let g{X) be a real single-valued 
function. The mathematical expectation of g{X) is defined as 

E[g{X)] = I M(x.) 

t*l 
220 


(6-1) 



STATISTICAL AVERAOES 


221 


I'his definition is contingent upon the convergence of the above series; 
however, in most physical problems the convergence restriction is only 
of theoretical significance. The mathematical expectation of a fund, ion 
is, in a way, the “average/’ of that function over all possible values that 
the function assumes. For example, if the outcome of a random experi- 
ment assumes numerical values g{.V 2 ), . ■ ■ , with freciuencies 

nij n 2 , . . . , and the experiment is repeated N times, then the arithmetic 
average of the function (j{x) is 


g{X) average = - 


nig{xi) + Uigixi) + n^g{x^) + 


N 


(0-2) 


When the number of trials N is made very large, it can be intuitively 
inferred that the average value of g{X) in Eq. (0-2) approaches the expec- 
tation of g{X) as defined in Eq. (0-1). The mathematical expectation is 
also called the average ^ aialistical average^ and the mean value of the ran- 
dom variable. 

The reason for using the term “statistical average” will be apparent 
when one has proved the theorem which is generally known as the law of 
large numbers. 

Note that, in the simple' case of g{X) = X, Eq. (0-1) gives the average 
value of the random variable X itself. Thus, 


EiX) = X av(‘rago = ^ p,.c, (6-3) 

1-1 

Also note that when X = const = A", then K{X) = E{K) = K. 

The above definition can be extended to the case of multidimensional 
discrete random variables. For example, suppose that a function g{X,Y) 
is associated with a two-dimensional discrete random variable (X, F) 
having a joint probability distribution The concept of mean 

value as outlined in Eqs. (0-1) and (0-2) leads to the following definition 
for the mathematical expectation of the function ^fX,F) ; 

FAgex,Y)\ = X I PUM^uU,) (0-4) 


The above defining equations can be directly extended to cover the 
case of a random variable assuming a continuum of values. Let f{x) be 
the density function and ^(.c) a real single-valued integrable function; 
then we define 


Tor instance, 


FigiX)] = s(-c)/(x) dx 

E[X] = xf(x) dx 


(6-5) 



222 


CONTINUUM WITHOUT MEMORY 


Similarly, for a two-dimensional random variable {X^Y) and the 
weighting function gix^y), we define 

E[9iX,Y)] = g(x,y)f(.x,y) dx dy (6-6) 

All integrals in the defining equations must exist. 

Example 6-1. An urn contains three white and two Idack balls. A and B agree to 
the following game. Each person draws two balls at a single drawing, the balls 
b('ing rf'phie(‘d aft('r each drawing. B will pay A the amount of ¥5 for each white ball 
and 12 for eacdi black ball. , 

(a) What is the mathematical expectation of the player A? 

{h) How much should A pay B for the drawing of a white ball and a black ball sp 
that their expetdations are the same? ' 

Solution 

(a) Ther(; art* three possible cases: WW ,WB or BWy and BB. 

P(WW) = H ■ = Ho 

P(WB) = % • K = Ho 

P(BW) = %•% = Ho 

P(BB) = H • = Ko 

Jjct X be the gain of A in a drawing; then X will assume the following three values at 
random : 

x\ = 10 X2 ^ 7 xs = 4 

Pi ^ Ho Pi = Ho pa = Ho 

Now one may apply Eq, (6-1). 

E{X) = Ho ■ 10 + Ho • 7 + Ko • 4 = 7.6 

In the long run, A can expect an average gain of S7.60 in each drawing. 

(b) Let B*8 gain be x and y dollars for the drawing of a white and a black ball, 
respectively. Then if is the gain of B^ 

EiY) - Hoi'2x) + Ho(^ +y) + }U{2y) 

The answer to the question is given by any values of x and y satisfying 

3i -h 21/ - 19 = 0 

6-2. Expectation of Sums and Products of a Finite Number of Inde- 
pendent Discrete Random Variables. In this section we employ Eq. 
(6-4) in obtaining the expectation of sums and products of a number of 
discrete random variables. 

Sum of Random Variables. Let 


g{XJ) =X+Y 


(6-7) 



STATISTICAL AVERAGES 


223 


Then in the discrete case 

E(X +Y) = llP\i,j\ix, + v,) 

= i) 2] E{iJ\x, + X Z f'lhJhh 

» 7 I / 

= Z Z ! ' -7 1 + Z Z i 1 

Finally E(X + Y) = E(X) + EiY) (6-9) 

This relation is also valid when X and Y are random variables of a con- 
tinuous type. 

More generally, for the expectation of the sum of a finite number of 
discrete random variables (not necessarily independent), one obtains 

E{X, + X 2 + * • • + X„) = K{X,) + E{X^^) + ■ • • + E{X^) (0-10) 

provided that the expectation of the individual variables has a finite 
value. 

Product of Two Independent Random Variables. Let 

g(X,Y) = XY 

For independent discrete variables we have 

E{i,j] = pi^'l ■ vAj] 

Hence, 

E{XY) = ll\x,vm\yMj)] 

I 3 

= Z Z (6-11) 

’ E{XY) = E{X) ■ E{Y) (6-12) 

This result also holds for independent random variables with con- 
tinuous distributions: 

EiXY) = xijfix,y) dx dy 

= [x/i(x) dx][yh{y) dy] = E{X)E{Y) (6-1 2a) 

When one of the variables has a constant value K, we have 
E(KX) = E{K) ■ E{,X) = KE{X) 

By induction one arrives at the result that the expectation of the prod- 
uct of a finite number of discrete independent random variables is the 
product of their expectations. 

X„) = E{X,) ■ E{X^) • • • E{X„) 


E{XiX2 


(6-13) 



224 


CONTINUUM WITHOUT MEMORY 


Note that the independence of the variables is required for Eq. (6-13) 
but not for Eq. (6-9). 

Example 6-2. Two dice are thrown; find the expected value for the sum and the 
product of their face numbers. 

Solution. The joint probability P\i,j] and the marginal probabilities pi{i) and 
PzO) arc 

Pl(^) = k PzO) = Vfi 
EiX) = E(Y) = lid + 2 + 3 -h 4 -b 5 -h 6) = 

E(X -h F) - 3'^6(2 +3-b4-h5+6+7 
+3+4+5+ 6 +7+8 
+4+5+0+7+8+9 
+ 5+ G+ 7+ 8+ 9 + 10 
+ () + 7 + 8 + 9 + 10 + 1 1 
+ 7+8+9 + 10 + 11 +12) 

= ^6(27 + 33 + 39 + 45 + 51 + 57) = 1^6 ■ 252 = 7 


This direct calculation checks with the result E(X + F) = /'^2 + Jid =7. 

The product of the two face numbers assumes the following values, with their 
corresponding probabilities. 


1 2 3 4 5 G 8 9 10 12 15 IG 18 20 21 25 

\U Hu Hu Hu Hu Hu 

E(XY) - HuO + 4 + 6 + 12 + 10 + 24 + IG + 9 
+ 20 + 48 + 30 + IG + 3G + 40 
+ 48 + 25 + GO + 36) 

E(XY) = 1^6 . 441 = 


E(X)-E{V) 49^ 


30 3G 


6-3. Moments of a Univariate Random Variable. Equations (6-1) and 
(6-5) describe the mathematical expectation associated with a general 
function of a random variable g{X). The particularly simple function 


g{X) = X^ 

plays an important role in the theory of probability and application 
problems. The expectation of X^^ that is, 


or 


Mr = E[X'] = 2 P.A' 

Mr = E\X^] = x'/ix) dx 


(6-14) 


is called the rth moment about the origin of the random variable X. 
The definition is contingent on the convergence of the series or the 
existence of the integral, i.e., on the finiteness of the rth-order moment. 
For r = 0 one has the zero-order moment about the origin. 


E{X^) = 1 



STATISTICAL AVERAGES 225 

For r = 1 the first-order moment about the origin or the mean value of the 
random variable is 


E{X^) = E{X) = mi = ^ P.X, (6-15) 

t=l 

/ ■^ GO 

xf{x) dx 

For r = 2 the vsecond-order moment about the origin is 

E(X'^) = + P2X2^ + • • ■ + PnXn^ + ' (6-16) 

or E{X^) = P’ x-f(x) dx 

The rth-order moment about a point c is defined by 

EliX - cy\ (6-17) 

A very useful and familiar ease is the moment of the varial)lc centered 
about the point ???i, the mean value of the variable. Such moments are 
called central momenta. 

^lr = central moment of order r = E[X — E{X)Y = E{X — miY (6-18) 

The values of central moments of first and second order in terms of ordi- 
nary moments arc discussed below. 

First-order Central Moment 

pi = first-order central moment = expectation of deviation 
of random variable from its mean value m 
= E{X - mi) 

PI = E{X - mi) - E{X) - E{m^) = Wi - rrii = 0 (6-19) 

Second-order Central Moment 

P 2 = second-order central moment = E[{X — mi)^] 

p, = E{X'^) - 2E{Xmi) + E{mi^) = E{X^) - mi^ (6-20) 

The second-order central moment is also called the variance of the random 
variable X, 

M 2 = var(Z) = i;[(X - m)^] = E{X'^) - mi^ = m 2 - mi^ (6-21) 

The nonnegative square root of the variance is called the standard 
deviation of the random variable X. 

Standard deviation = ax = \/m 2 — m^ — mi^ (6-22) 

The physical interpretation of the first and second moments in engi- 
neering problems is self-evident. The first moment mi is the ordinary 



226 


CONTINUUM WITHOUT MEMORY 


mean or the average value of the quantity under consideration, and m 2 is 
the average of the square of that quantity (mean square). For instance, 
if X is an electric current, mi and m 2 are the average (or d-c level) of 
the current and the power dissipated in the unit resistance, respectively. 
Similarly, the standard deviation is the root mean square (rms) of the 
current, about its mean value. 

Example 6-3. In part (a) of Example 6-1, find ttii, m 2 , M 2 , and a. 

Solution 

E{X) = mi = 7.6 

E{X^) = m 2 = Mo ■ 100 + Mo ■ 49 + 3io - IG = 61.0 
^2 = m 2 - mi2 = 61.00 - 57.76 = S.24 
tf “ M2 “ I'fi 

The above definition can be extended to the sum of two or more ran- 
dom variables. The pertinent algebraic operations will be simplified 
by the use of some familiar mathematical formalism. Let E{X) and 
<y{X) stand for the expectation and the standard deviation of a random 
variable X, respectively; then for the sum of two random variables we 
write 

var (X -h F) = E[{X + Yy] - [E{X + Y)y 
= E{X^) + X(F2) + 2E{XY) 

~ [E^{X) + E‘^{Y) + 2E(X) • EiY)] (6-2;i) 

If the two variables are independent, then E{XY) = E{X) • E{Y), and 
Eq. (6-23) yields 

(6-23a) 

This result can be extended to obtain the standard deviation of the sum 
of a finite number of independent random variables. 

<y\+z,+ -+x, = fx.’* + + • • • + (6-24) 

Example 6-4. X is a random electric current normally distributed, X = 0 
and <r *= 1 . If this current is passed through a full-wave rectifier, whut is the expected 
value of the output? 

Solution. Let the output of the rectifier be Y; then 

y = ixi 

Applying Eqs. (5-73), the probability density of Y is 

p(y) = 2^^ = 2 ^ e-'"* 1/ > 0 

The density distribution has the shape of a normal curve for 3 / > 0 and is zero else- 
where. 

E{Y) = /"* \j^ dy = amperes 

Jo 'TT ^ IT 



STATISTICAL AVERAGES 


227 


Example 6-6. The internal noise of an amplifier has an rms (root mean square) 
value of 2 volts. When the signal is added, the rms output is 5 volts. What would 
be the rms value of the output when the signal is tripled? 

Solution. This simple example is rather familiar to electrical engineers. While the 
solution may look obvious, we shall give the hypotheses under which the familiar 
solution is obtained. I^et S and N be independent random variables (signal and noise) 
with zero means and given standard deviations Ox and an. 

Root mean square of <S *= \/E{S^) = ^/a^ = a^ 

Root mean square oi N — \/TiiN^) = = an =2 

As the rfindoni variables are assumed to be indipiiuhrit, by Eij. (fi-2Jlo) we find 


E[(s + Ny\ = a;(S“) + E{^^) = .Ta* + = 52 

a.2 = 52 - 22 = 21 


The problem asks for 

Now E('AS -\- Ny 


VEm'-f~Ny 

= E{9S^) + AX A' 2) = 9 X 21 + 22 
= VToii = 13.9 volts 


Not(' that no additional assumption is necessary as to th(‘ nature of the distribution 
functions of the signal and the noise. 

6-4. Two Inequalities 

An Inequality for Second-order Moments. We should like to compare 
the second-order moment of a random variable about a point c with the 
second-order central moment M 2 of the same variable. 

E[{X - cy] = E[(X - mi + - cY] 

E[iX - cY] = E[(X - m,Yl + 2E[(X - mi)(mi - c)] + ^f(mi - cY] 
E[(X - cY] = M2 + (mi - cY 

The second-order moment aliout any point c 9 ^ mi is larger than the 
central second moment; 


E[{X - cY] > E[{X - mi)^l for r mi 

The fact that the second-order central moment of a random variable is 
the smallest of all second-order moments is of basic significance in the 
theory of error. In the analogy with electrical engineering, note that the 
smallest possible root mean square of a fluctuating current or voltage 
f{t) is obtained when that quantity is measured with respect to its mean 
value (d-c level) rather than any other level. 

Chebyshev Inequality. The Chebyshev inequality suggests an interesting 
relation that exists between the variance and the spreading out of the 
probability density. Let X be a random variable with a probability 
density distribution /(x), first moment mi, and standard deviation u, 
Chebyshev's inequality states that the probability of X — mi assuming 
values larger than he is less than 1/fc®. 

■PIIX - mil > fc<7) < i fc=l,2, 3, 


(6-25) 



228 


CONTINUUM WITHOUT MEMORY 


To prove the validity of this inequality, let F = |X — mi|, and refer 
to Fig. 6-1. The desired probability is 

P\Y > ka} = ^ fix) dx + f(x) dx 

Multiplication by kV yields 

- m,| > k<r] = dx+ f” kVJ{x)dx (6-26) 

i — “ J mi-\‘ hr 

Note that in each of the ranges of integration 

ka < \x — mil \ 

Thus 

k'^a'^P\\X — mil > ka] ^ j ^ dx 

+ - ^i)V(x)dx (6-27) 

But the right side is certainly not greater than the second-order central 
moment of ; therefore, 

kW{\X - »)i| > k(r\ < (x - OTi)'/(x) dx = 

Dividing both sides of tliis ineciuality by gives the desired inequality. 



Fie. 0-1. llliLstratioii of Chobyshev’s inoqualify. 


The Chebyshev inequality expresses interesting bounds on the proba- 
bility of the centralized random variable exceeding any units of standard 
deviation. For example, 

PllX - mil > 2a] < 0.250 , 28 ^ 

7"[|X - mil > 3(7} < 0.111 

This result may be applied to a specihe known density distribution such 
as a normal distribution. P'or normal distributions, from tables we 
derive 


P{\X - mil > 2a] « 0.045 
- mi > 3(7) « 0.026 


(6-29) 



STATISTICAL AVERAGES 


229 


A comparison of Eqs. (6-28) and (6-29) shows that Chebyshev's results 
arc weak and that they give only a rough estimate of the spreading of the 
distribution. Similar results are valid for a discrete random variable. 

Example 6-6. Lot X bo tho fraction of nuinbor of Loads obtained in thiowiiiK an 
honest coin 10** times. Show that 

- ^j| > 0.001 j < O.OI 

Holulmn. l^et Ai be a rancloin variable denoting tho nunibc'r o1 
throwing, 

X — — (Xi -}- A 2 + * ' • + A"«) n = 10“ 
n 

= = = *'2 < r . = - 

2 v /< 

Applying tho Chobyshov inequality to A', wo find 

P{|A" — 7 //| > A' t < ^2 = *2 K — 10““ 

/^||X - I 2 I > 0-91)1 I < ^400 < 001 

6-6. Moments of Bivariate Random Variables. Equations (6-3) to 
(6-6) give expressions for the expected value of a function of a one- 
dimensional and a two-dimensional random variable. In Sec. 6-3 we 
discussed different moments of a one-dimensional random variable by 
letting g(X) — X\ r = 1, 2, 3, . . . . The object of this section is to 
derive similar formulations for moments of a two-dimtnisional random 
variable by letting 

g{X,Y) =X Y^ i,j= 1,2,3, . . . (6-30) 

The moment of order f, j of a two-dimensional discrete random varialile 
is defined by 

a., = E{X^Y^) (6-31) 

For a continuous random variable we have 

a.-, = EiX'Y^) = r,Mx,y) dx dy (O-I^la) 

The central moments correspond to the variable being centered at the 
point representing its two-dimensional first moment, i.e., the point 

U,F): 

= E[iX - Xy{Y - Yy] (0-32) 

The central moments of second order are of considerable interest. 
They can easily be computed in terms of the central moments of X and F. 
Following the above notation, 



230 


CONTINUUM WITHOUT MEMORY 


aio = X = E{X) 
aoi = Y = E{Y) 

au = XY = E{XY) (6-33) 

ajo = = E{X^) 
ao2 = = E{Y^) 

The three central moments of second order are 


M 20 = E[{X - xy] = ( 7,2 = ^20 - aio^ 

Mil = E[{X - X)(Y - ?)] = E{XY) - XF = an - aio ■ aoi (6-$4) 
M02 = E[{Y - F)2J = = ao2 - aoi^ ^ 

In the next section we shall show that the ratio of mii/\/m 2 omo 2 giv^s 
an indication of the degree of linear dependence between the two variables. 

6-6. Correlation Coefficient. In everyday problems of the physical 
sciences, we are often confronted with the study of two variable (quanti- 
ties, X and F, with a functional relation y = /(x) betwcien them. Some- 
times no specific knowledge of any such relationship between the two 
variables is available, but a set of several of their paired values is known. 
In such situations, as a rule, we do not search for an analytic relation 
among the variables but wish to find out if the values of one are influenced 
by the values of the other. For example, one variable may represent 
the heights of students in a university and the other variable the heights 
of their fathers. While it is hopeless to establish a functional relationship 
between these two variables, it is quite reasonable to expect a certain 
degree of dependence between the two. A measure of the degree of this 
common influence is given by what is referred to as the correlation 
coefficient* p defined by 


Mil 

P = 


Mil 

( 7 ,( 7 ^ 


(6-35) 


It can be shown that the permissible values for the correlation coefficient 
are confined to the interval [—1, + !]. One way of showing this is by 
considering a new random variable 

Z = a{X - X) + h{Y - Y) - (6-36) 

where a and b are real parameters. The second moment of the random 
variable Z must, of course, remain a nonnegative number for all values 
of a and h. 

E{Z^) = E[a{X - X) + h{Y - F)]^ = + 2abfin -h 6 V 02 (6-37) 

* The coefficient of correlation appear(^d in the work of Karl Pearson in 1891. An 
interesting historical and technical account of this topic can be found in the article 
On the Mathematics of Simple Correlation by C. D. Smith (Math. Mag., vol. 32, 
no. 2, pp. 57-69, November-Deceinber, 1958). 



STATISTICAL AVERAGES 


231 


The leading coefficient of the quadratic form 


M20 


+ 2 /iii j + H02 


(6-38) 


is nonnegative. Hence the discriminant should not become positive; 


— M20 • Mo2 < 0 

or, in terms of the correlation coefficient, 


iPl = 


Mil 




20 ' M02 


< 1 


(6-39) 

(6-40) 


In the extreme case, when p = ±1, there is complete dependence 
between the variables .Y and Y. When the two variables are inde- 
pendent, (Y — X) and (F — F) are also independent variables. This 
fact implies 

pn = E{X - X) • E(Y - F) = 0 (6-41) 


Thus, when X and F are independent, their correlation coefficient is zero; 
the converse, however, is not true.* The correlation coefficient p indi- 
cates a measure of tlie linear interdependence between the two variables. 


y 


X 

(a) 

y 


\ 






^p=l 

(c) \ 



Fio. (>-2. {a) No correlation; (6) no correlation but strong dependence; (c) linear 
dependence; (d) scatter diagram. 

When points of a rectangular coordinate system are used to show 
ordered pairs of values corresponding to pairs of associated discrete ran- 
dom variables, we obtain a diagram commonly known as a scalier diagram. 
A scatter diagram generally gives an indication of the degree of linear 


It is important to note that two random variables may be dependent but not 
linearly correlated (see, for example, Prob. 6-16). 



232 


CONTINUUM WITHOUT MEMORY 


relationship between the two variables. For example, if the two varia- 
bles are uneorrelated, i.e., if p = 0, we may find the somewhat uniform 
scatter diagram of Fig. 6-2a or h. In these cases the variables are lin- 
early independent, although they may be bound by some nonlinear rela- 
tionships (Fig. 0-22;). Conversely, when the variables are linearly related 
(p = ± 1 ) the scatter diagram takes the general form of Fig. C)-2c. 

In problems of engineering statistics, frequently several pairs of values 
of two related discrete random variables are known. The problem 
of interest is to find an estimate of this linear relationship such that we 
have the best straight-line fitting. The word “best” is taken in the sense 
of least mean square; that is, if the desired line has a cartesian equation^ of 
the type \ 

Y = A + BX 


then A and B must be such that the sum 

giA,B) = E(Y - A - BXy (6-42) 

has its smallest possible value. This condition recjuires that the first 
partial derivatives of g{AjB) be simultaneously zero. 


dA 


-2E(Y - A - BX) = 0 


or 


2EiXY - AX - BX^) = 0 

oB 

Y = A+BX 
XY = AX + BJP 


These equations give 


Y = Y + ^-l(X - X) 


(6-43) 


(6-44) 

(6-45) 


This is the equation of the best-fitting straight line in the least-square 
sense. In the statistical literature this line is referred to as the regression 
line. Note that the regression line goes through the point (X,F). 

6-7. Linear Combination of Random Variables. Let X be the linear 
combination of a finite number of random variables. 


X = aiXl -b Cl2-X^2 + ■ * ■ + dnXn 

We assume that the coefficients Ui, 02 , . . . , Un have fixed values. The 
mathematical expectation of X is 

E{X) = E{aiXi -f < 12 X 2 + ■ ■ ■ + UnXn) 

X = aiXi + a2X2 + ■ * ■ H" dnXn 


(6-46) 



STATISTICAL AVERAGES 


233 


The variance of X is 

(Tx^ = E[{X X)^] = E[ai{Xi — X\) + a2{X2 — X 2 ) ‘ ' 

+ an{Xn - X„)]“ (G-47) 

Note that Eq^s. (6-46) and (6-47) are valid for dependent as well as inde- 
pendent variables. But only when all the variables are ind(‘p(‘ndent 
does Eq. (6-47) lead to the following simple expression: 

(Tx^ = arai- -f a2V2^ + • ■ ■ + (in^crj (6-48) 

For example, consider X to be the number of heads in a sequence of N 
throws of a biased coin with the probability of a head in each throw being 
equal to p. Then 

X = Xt + X2 -h • • ■ + X. 

where Xi = 1 if the ith throw results in a head and = 0 otherwise. 
Now 

P{X, = 1) = p 
P{X. = 0) = 1 - p 

P(X0 = P 

= P(AV) - fP(X.)J2 = p(l - p) 

As the X» are independent, we have, by application of Eejs. (6-46) and 
(6-48), 

E(X) = E(X,) + E{X2) + ' • ' + X(X.) = Np 

(TX“ = + crx,“ + • ■ ■ + = Xp(l — p) 

Example 6-7. Lot 


Find the mean and the standard d«*viation of Y when 
X = rn and ^ 1 = 0 - 

Solution 

y = - (X - m) = - (W - m) = 0 

<F <T 

1 7tl 

(Ty^ — -h variance of — =1 + 0 = 1 




234 CONTINUUM WITHOUT MEMORY 

Solution 

“ /o /o + y) dx dy ~ J42 

^ ” /o /o + y'> ‘^y ^ ’’/ii 

tHx + y) dx dy = X2 

= ^^2 

= ff„’ = ?i2 — (K 2 )* = ^H44 

Ml, = - X)(K - ?) = ii’cAT) - Xy = ><; - = -M44 

Mil ^ —3144 ^ 1 

^ (Tr^y ^ 3 j 44 11 

6-8. Moments of Some Common Distribution Functions. In t^is 
section we should like to apply some of the material of the preceding 
sections to some familiar disfrihutions. 

Binomial Distribution. The random variable X assumes the values 

Xk = k k = 0, 1, 2, . . . , n 

with the probability 

Pk = 7/(1 - p)"~'‘ 

The first and second moments are 


rni 


» 

- - X *' *■!(;- A)! 

A: = 0 

fc = 0 


(6-49) 


nil == nplp + (1 — = np 

n 

jn2 = ^ p*(l — 7?)"“* 


t»o 


^6-50) 


m 2 


= np ^ /f pi ‘(1 - P)” 

k = 0 

= «7> (/'• - 1) 1 

JfcTo 

+ 1 (“ - 0 ■ ’■’■"I 

m 2 = np[(n — l)p + 1] 

— m 2 — mi® = np[{n — l)p + 1] — (np)® = np(l — p) (6-51) 



STATISTICAL AVERAGES 


235 


Poisson's Distribution. The random variable X assumes the values 


= n /r, = 0, 1, 2, . . . , /i, . 


with the probability 


P — 

In — I 

ni 


mi = K(X) = e " [o + >^ + + ^ + • • • + (,, ^ ] ) ! + 


m2 ~ E 


rfi] — e ^ * X = X 

i(A-=) ..->(o+l<^ + 2=^’+ ■ . . „=^'; + 


In order to compute m-i in a closed form, note that 

n^ __ rt(7i — 1) + ^ _ 1 , 1 

n\ ri\ {n — 1)1 (n — 2)! 

Thus n ?2 = e"^((*^X + c^X^) = X + X^ 

The variance and standard deviation are, respecjtively, 

M2 “ w?2 — mi' = X 

(T = \/\ 


((>- 52 ) 


((>- 54 ) 


( 0 - 55 ) 


Normal Distribution. I'he random variable X has a densitjr distribu- 
tion function 

j{jc) = . — e 

(T y/2Tr 

mi = E{X) = — ~ I “ xe- rfx = - - = 0 

cr v 2ir J - « \/2ir 

( 0 - 50 ) 

m 2 = E{X'^) = — /“ ( 0 - 57 ) 

0- V^TT y_ao 


a/m 2 = a 

When the distribution is centered at a point with abscissa m, th(^ mean 
value of the variable is m and its standard d(^viation is a. A normal dis- 
tribution is completely determined by its two parameters m and a: 


-■ - g-[(x-TO>V2<r2) 

O' \/27r 



236 


CONTINUUM WITHOUT MEMORY 


Cauchy's Distribution. Cauchy's distribution provides an example 
where a moment is not defined ; in fact, 

- a 

This integral is not convergent; hence the first moment is not defined. 

Bivariate Normal Distribution. The normal bivariate density function 
was discussed in Sec. 5-11. For convenience of interpretation, herel we 
shall give the normal bivariate in terms of its statistical parameters. 


/(^,2/) = r 


- p')» [2(1 - [( ■) 


In this equation the parameters mi, m 2 , ^i, ( 72 , and p have direct statistical 
interpretations; for example, 


^ r 00 r QO 

aio = X = /__ /_^ •'■/U',.'/) = rrii 

aoi = r = _ yfix,y) dx dy = nii 

020 = A'^ = j ^ j ^ ^^f{x,y) dx dy = 

ao2 = 75 = J_"_ 2/*/(j,i/) dx dy = (.2- + m./ (G- 58 ) 

M20 = (x - miyf{x,y) dxdy = a 

Mo 2 = y_”°^ (2/ - m-iY!{x,y) dx dy = <72“ 

Mil = (a: — wJi)(// - mi)f{x,y) dx dy = p<n<Ti 

These relations can be verified by direct computation, mi and m 2 are 
average values of X and F; ai and 0 - 2 , their respective standard deviations; 
and p, the correlation coefficient. 

Binomial Distribution in Two Dimensions. Consider an experiment 
involving the joint occurrence of two specific events Ei and E 2 - 


P{E^] = Pi 
P[E2] = P2 
P[EiE2] = 0 

If the experiment is repeated k times, the corresponding probability 
distribution, that is, the probability of having x times event Ex and y 



STATISTICAL AVERAGES 237 

times event Ei, is 

= AjW^ ~y?. 

The different moments and eorrelation coefficients are found to be* 

Mil = —kpipi jtij = kpi{] — pi) fioj = kpiil — Pi) 

' " - 

Poisson^s Distribution in Two Dimensions. losing the same notation 
as in the case of binomial distribution in two dimensions, let pi and pz 
tend to zero and k tend to infinity while 

kp\ — » a 
kp2 — > b 

The two-dimensional Poisson distribution is the limit of the two-dimen- 
sional binomial distribution. 

fM = e"""" ( 6 - 61 ) 

Example 6-9. Ijet 0 bo a random variable uniformly diBtribiitcd in the interval 
— (7r/2) to -|-(ir/2). Find the first and the second moment of the function 

^ ~ qW = A sin 0 

Solution, The first step is to find the density function for the variable X. Accord- 
ing to Eqs, (5-73), 

/(a-) = = \Iz = 

This density function is shown in Fig. E6-9. 




.^( 9 ) 





-■j 





* See, for instance, A. Guldberg, Sur les lois de probabilit^s et la correlation, Ann. 
Inst, Henri Poincare, fascicule II, vol. 5, pp. 159-176, 1935. 



238 


CONTINUUM WITHOUT MEMORY 


The inomontK of first and second orders arc 

nil 

m2 


6-9. Characteristic Function of a Random Variable. The characteris- 
tic function of a random variable A"' is defined as the mathematical expec- 
tation of where t is a real variable, e the base of the natural logarithm, 
and .7 = When the distribution of the random variable is abso- 

lutely continuous, one has 

<#>,(/) = dx (6-62) 

/(j) being the probability density function of the variable X. When the 
random variable X is of the discrete type, the characteristic function is 
defined by 

Mt) = (6-62a) 

7 

The characteristic furiction is always a well-defined function, that is, the 
integral of Eq. (0-02) always converges. Alorc'over, it can be proved 
that the characteristic function uniciucly determines a distribution 
function, in the discrete, the continuous, and other possible cases. The 
proof is not given here. However, the reader can see that is the 
inverse Fourier transform of /(a:), that is, the two functions arc inter- 
related by Fourier integrals. 

fix) ^ ^ J cr>‘^<l)^it) di (6-63) 


The following properties of the characteristic function are of immediate 


interest. 

- 


(I) 

«x(0) = 1 


(ii) 

kx(«)l < 1 

(6-64) 

(HI) 

4>xii) is a continuous function 



= E{X) = 

= E(X^) = I 
~ 


X dx 


A jr(A* - ” 

dx 

-A 'ir(A^ -~X^i ~ li 


Property I is self-evident from Eq. (6-62). In order to show the 
validity of property II, note that 

I j e^^^f(x) dx I < y /(x) dx = 1 


(6-65) 



STATISTICAL AVERAGES 239 

While the inequality (6-65) is self-evident, one may alternatively use the 
following novel physical reasoning. 

Consider a unit circle in a complex plane. We select a number of 
points on the unit circle with coordinates 

cos tXk- j sin tXk A- =■ 1, 2, . . . 

To each point we assign a mass pk equal to iho pro])ahilit y associated with 
the value Xk of our discrete random variable X, 


Xi, X2, . . . , Xk, . . . 

Pi, P2, . . . , Pk, . . . 

Obviously, the center of gravity of these masses located on a convex curve 
(unit circle) must remain within the convex curve* for all values of the 
real parameter t. But this center of gravity coincides with the point 
; thus 


= E(cP^^) = £'(cos IX + .7 sin iX) (6-66) 

<t>x{t) = Y, + j sin tXk) = Y Vh f'os txi, + j Y Vk iXk 

■ ' /fc-i k=\ 

(6-67) 

10.(01 < 1 


A: = l 


Property III requires more detailed mathematical (‘onsideration and so 
is omitted here. 

6-10. Characteristic Function and Moment -generating Function of 
Random Variables. For two independent random variables X ^ Y and 
their sum Z = X Y the following relation is immediately evident: 


= E{r^'^^)E{(>^^^’) ((>- 68 ) 

0z(/) = 0x(O0.(O 

By induction one concludes that the characteristic function of the 
sum of a number of independent random variabh's is the product of their 
characteristic functions. 

Next let us compute the characteristic function of a random 

variable Y which is a linear function of another random variable A"': 


Y = aX + b Gj b real numbers (6-70) 

0^(0 = E{c’^^') = 

Finally 0^(0 = (6-71) 

The characteristic functions of some of the common distribution func- 
tions will now be derived. 



240 


CONTINUUM WITHOUT MEMORY 


Binomial. The characteristic function of the binomial distribution is 


n 

^ p‘(l - p)"- 

0 

^ X (^) ~ 

This obviously is the binomial expansion of 

[(pe^‘) + (1 - p)]" = Mt) 


(6-72) 


(6-73) 


Poisson. The characteristic function of the Poisson distribution is 


oO 

«x(<) = ^ 




(6-74) 


<t>At) = (G-75) 

Normal. The characteristic function of the standardized normal dis- 
tribution is 


/’ * pjtr 

♦.(« - f" ef -'’dx 

V 27r y _ « 

■v/2ir 7-. 

= — exp ( — !^ f dx 

V2ir 2y7_. 


(6-76) 


e-M(x«-2jxr-e+r-) 


<^(<) = e- 


(6-77) 


Example 6-10. Find the eharaeteristic function of a standardized random variable 

y _ X ^ m ^ 

a 

where m — X and a * a*. 

Solvtion. Applying Eq. (6-71), one finds 


Ml) = r-'*”'"'*, (^) 


The different-order moments of a random variable can be obtained 
directly from the expansion of its characteristic function. 

itx 1 1 I 


(6-78) 



STATISTICAL AVERAGES 


241 


Applying this equation to (6-62), one obtains 
Ut) = m dx + f! xf(x) dx + X 


*/(x) dx + 


(6-79) 

While the characteristic function ^,(0 always exists, the above expansion 
is not always possible. Also it is to be noted that, even if all the moments 
exist, the above expansion is valid only in its region of convergence. 
Subject to these restrictions, the rth-order moment is 


[f di' 


(6-80) 


The expansion of the characteristic function gives 


<t>x{t) = </>x(0) + </)i(0) + </>i'(6) + ■ ■ ■ + ^ ' 

Moment-generating Junctions. While the different moments have been 
derived directly from the characteristic function, they could alternatively 
be obtained from the moment-generating function. The latter function 
is defined for a discrete and a continuous random variable, respectively, 
as 

= E{eV = y P,e'^' (6-81) 


Ut) = = I 

I 

iAt) = Eie‘^) = c'^fix) dx 


(6-81o) 


Note that ^(jO = 

In a manner similar to the derivation of Ec}. (6-80), one can see that 


(6-82) 


'd^^l,At)^ ,, 


dr J 

{dr ) 


r-j = = Eix'e‘-) 


The defining equation of the characteristic function can be directly 
extended to cover multivariate distributions. For instance, for a bivari- 
ate distribution we have 

0(?/,z;) = dx dy (6-83) 

The power expansion of this function will lead to the calculation of 
different-order moments. For example, 



242 


CONTINUUM WITHOUT MEMORY 


Example 6-11. P^ind tho first and second moments of a Poisson distribution. 
Solution 

□o 

^*(0 = E{e}^) - ^ X = 0, 1, 2, , 

x = 0 


0 


~ (1 + Xe*) 


mi — 

m2 = 


dt J 

'dH'sHY] 
di^ J 


= X 

= X(1 + X) 


= X(1 -}- X) - X2 = X 


6-11. Density Functions of the Sum of Two Random Variables. In a 

number of problems which occur in practice one is interested in finding 

the density function for the sum of two 
independent random variables X and Y 
when the individual density functions, say 
/i(j-) and }i{y), are known. From the fact 
that the two variables are independent, 
we conclude that the density of the two- 
dimensional random variables (X, Y) is 
given by 



Fio. G-3. Dctormination of tho 
probability density for tho sum 
of two independent variables. 


f{x,y) = h{x)!i{y) 


(6-85) 


Our problem is first to find the probability distribution function for the 
variable X + F, that is, 

P{X +Y<t] 


This is done by first drawing the line 

X +Y = t (6-86) 

in the (X,F) plane with t as a parameter (Fig. 6-3). Now the probability 
under consideration can be obtained by integrating the density function 
[Eq. (6-85)] over the shaded region R. 

P{X + Y <t] = II f{x.,y) dx dy = fi{x)h{y) dx dy (6-87) 

R 

P{X + Y < t] = 1’^ Flit - y)My) dy 
where Fiix) is assumed to be the CDF of the variable X. 


( 6 - 88 ) 



STATISTICAL AVERAGES 243 

The desired density function is obtained by taking the derivative of the 
integral with respect to t : 

~ <iy (6-89) 

An alternative method of obtaining the density function for the sum of 
two independent random variables involvevS the use of the concept of the 
characteristic function and its relation with the Fourier transform. In 
fact, 

1 / QO 

Density for X = f^{x) = - - / c dt (6-90) 

^TT J —oe 

Density for Y = J^iy) = ~ / c-J"'(t>y{t) dl (6-91) 

, ^TT J _ 00 

Density for Z = f{z) = ~ j dl 

— 'TT J — tx 

= ~ dt (6-92) 

It is to be noted that the relationship between a density function and its 
associated characteristic function is that of th(^ familiar Fourier integral. 
Having this in mind, we recall the convolution theorem of the Fourier 
(or Laplace) transforms. According to the theorem for J'^ourier trans- 
forms, the inverse I'ourier (or Laplace) trajisfonn of the product of two 
functions corresponds to the convolution integrals of the two inverse 
functions. That is, if 



Mx) = 

My) = n<l>y(0] 

(6-93) 

then 

f(s) = ^l<i)Ai)<f>y(Oi = /i(-c) *My) 

(6-94) 

where 

fiix) * My) = F Mz - y)My) dy 

(6-95) 


The result of Ecp (6-95) can be generalized in order to obtain the proVm- 
bility density function of the sum of a finite number of independent 
random variables. Let 

x = lx, 

I 

where the Xk are all independent. The probability density of X is 

f{^) = M^l) * f2(X2) * • ■ ■ *fnM (6-96) 

Jf every variable is normally distributed, then their sum will also be 
normally distributed. The same is true when all variables ha^s'^e dis- 
tributions of the Poisson or binomial type (see Probs. 6-14 and 6-7). 



244 


CONTINUUM WITHOUT MEMORY 


Example 6-12. X and Y are two independent random variables with standard 
normal distributions. Derive the distribution of their sum. 

SoltUion. According to Eq. (6-94), the probability density of Z « AT -(- K is 




/.(a:) •Mx) 


-U-xY 

2 




2 V- 





(lx 


Make the change of varinhle: 


Thus 


z 



V 

\/2 






L_ 

2 V- 


1 


The density distribution of Z is also a normal distril)ution with zero iiieuii oiii u 
standard deviation of \/2. Note that the mean and standard deviation could have 
been predicted by a.p})lyinK Kq. (()-23a). 

EiZ) - E{X) 4- E{V) - 0 
E{Z^) = E(X^) -h EiV^) = 2 

Example 6-13. A gun fires at a target point which is assumed to be the center of 
the rectangular coordinate syst(’.m. Let (x,y) be the coordinate of the point at which 
the bullet hits, and assume that the associated random variables A" and Y are inde- 
pendent. We furthermore assume that A and Y aie normally distributed with 0 
mean and standard deviation 1. Find the probability distribution of the random 
variable 

= A2 -h 

Solution. It was showm in Example 5-11 that the density of X^ is 




1 

“%/ 27r 


Application of the rule of convolution yields 

f(r*) - f ’’ c-'" ^ e-(r-r)/2 rf;, 

Jo \/^2 \/27r(r — z) 

f' 

Jo y/z{r — z) 

From an integral table one finds 

fr I . 2z - rY 

I ax = sin ^ - 

Jo y/z{T - z) 

iW * ^ > 0 


rj 

Jo 


Finally we find 



STATISTICAL AVERAGES 


245 


Example 6-14. The random variables X and Y are uniformly distributed over the 
real interval zero to o (a > 0). Find the density distribution of 

(а) 2; = X -h K 

(б) Z -Y 


Solution 

(a) Since it is likely that most readers are more familiar with the ordinary (one- 
sided) Laplace transform than with the Fourier transform, in this example we shall 
change the parameter t to s, whicli is the symbol commonly used in engineering texts 
in conjunction with Laplace transforms. We also assume that the readers are familiar 




with the concept of a singularity function, such as is exemplified by the unit step 
ii-ilt) and unit ramp u-zit). Based on this notation, 


/i(j-) = - a_i(x) — a) density of X 

a a 

fziy) = - - a) density of Y 

a a 

}J/x{s) = £ - U-i(x) — £^ — o) 

a a 

^z(s) = — — — 
as as 


as as 

Since the two variables are independent, Eq. (6-69) gives 


= ^z(s) * (1 - e “)* 


The desired density function is the inverse Laplace transform of ^*(») ; thus, 

/(z) = (1 - 2e-“" -h = — [m_2(z) - 2u_2(z - o) + w_2(z - 2o)l 

a^s^ a* 

The graphical representation of /(z) follows directly from the definition of the unit 
step and unit ramp. This is illustrated in Fig. E6-14. 

( 6 ) 

Ms) “ - ^ - e°') 

=» (2 — — e“‘) 

/(*) = - ^ [2m_i( 2) - M_j(* - a) - M-j(* + o)] 



246 


CONTINUUM WITHOUT MEMORY 


It should be noted that for the convenience of calculation we have interchanged the 
defining terms of the Laplace transform with its inverse. Also note that the density 
function is similar to that of the previous case hut displaced by a distance —a along 
the axis of tlic random variable. 

The simplicity of computation here li(\s in the symbolism used for 
describing the rectangular-shape function in a closed form through the 
use of singularity fuiudions. Such methods are commonly practiced in 
electrical engineering problems. The reader, however, is warned against 
applying such techniques to density functions that are defined for — oo 
to + 00 . The proper extension of the above technique to such a problem 
requires the use of two-sided Laplace (or Fourier) transforms. \ 


PROBLEMS 

6-1. X and Y arc random variables uniformly distributed in tlu' interval 0 to 1. 
Find 

(а) P\X < K < + II 

(б) P{X < r < - 1) 

(r) P{\X - yi < II 

(d) P{\X - Y\ < M\ 

(e) PIX -hY < 1^1 
(/) PtXY < 

6-2. The random variable (A'’,F) assumes only the three sets of values (0,0), (0,1), 
and (1,1), with equal probabilities. Find 

IP(X - ii)(y - ?4)] 

6-3. Let X have a Poisson distribution, that is, 

/(■r) = ^ I = 0, 1, 2, . . . 

Show that 

E(x^) = aE{x -h 1) 

6-4. Let X and Y be independent random variables each with uniform distribution 
between 0 and 1. Find 
(a) E{X -h Y) 

(h) EliX + Yy] 

6 - 6 . Let X have a probability density 

/(I) = 

Find the expectation and variance of X. 

6 - 6 . Let X and Y be normally distributed independent random variabl es, eac h 
with zero mean and unity variance. Find the expectation and variance of \/x^ y^- 

6-7. Show that the sum of two independent binomial variables with the parameters 
(ni,p) and ( 712 , p) is also a binomial variable. 

6 - 8 . Same question for the sum of two independent random variables having 
Poisson distributions with means Xi and Xz- 

6-9. Ijet 5 be a random electric voltage varying between 0 and 1 volt, with a uniform 
probability distribution. The signal S is perturbed by an additive independent 
random noise N having a uniform distribution between 0 and 2 volts. 



STATISTICAL AVERAGES 


247 


(а) Determine X. 

(б) Determine the average power when the voltage A’ is applied to a resistor of 
2 ohms. 

6-10. A number X is chosen at random from the integers 1, 2, 3, . . . , n; find X 
and its standard deviation. 

6-11. A (lie is loaded in such a way that the probability of getting x is proportional 
to X (x = 1, 2, 3, 4, 5, ()). Kind the smallest number of throwings for which 

7M|A^ - E{X)\ > ^.1} < 0.001 

6-12. Two points A and H are ebosen at random on the cireiirnference of a circle 
with centt*r (J and radius h\ Let X b(‘ the area of th(‘ triangle A BC. Find A. 

6-13. Tjet X and Y be standardized normally distributed random variables, h'iiid 
the probability density of X /Y. 

6-14. Tlie ind(‘pendent variable's Xk all have distributions of the I’oisson type 

(/c — 1, 2, . . . ,//)■ 

AT = A t -f A 2 H" A^ „ 

(a) Find the characteristic function of Xk. 

(b) Find the characteristic function of X. 

(c) Det('rmine the distribution of A'. 

6-16. The probability density distribution 

f(x) =0 els('W'here 

IS called a (dii-square distribution, r stands for tlu' familiar gamma function, that is, 

r(A') = x^-h-^dx 

(a) Find the momiuit-generating function for tin' chi-square di.stribulion. 

(h) Find the first and the second moment. 

6-16. Let 

V 1 = sin 2wF Vi = cos 2irF 

wliere F is a random variable uniformly distributed between 0 and 1. Show that the 
two random voltages V i and F 2 are di'pendent but not correlated. 

6-17. The joint moment-generating function of tw'o random variables X and Y is 
given : 

= [a(e^+« + 1) + + e“)l' a > 0 ?> > 0 a + 6 = M 

Determine 

F{X), E{Y), var X, var T, and correlation coi'flicient 

6-18. From the monumt-gen crating functions d(^scribed in this chapter, derive the 
standard deviation of binomial, Poisson, uniform, and normal distributions in one and 
two dimensions. 

6-19. The random variable A" is normally distributed with (0, <r), and the random 
variable Y is distributed uniformly in the interval [ — tt, -f irl. Find the probability 
density of 

Z = AT sin y 

A solution in closed form may be found in Pugachev (Chap. 5). 



CHAPTER 7 


NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


This chapter is primarily concerned with the probability distributioi^of 
functions of a large number of independent random variables. The m^in 
results of the chapter are contained in two basic theorems with frequent 
applications. These are the law of large numbers and the central-limit 
theorem. Prior to the development of these theorems, we present in 
some detail the multidimensional gaussian distribution. The latter 
distribution is of particular signilicance in dealing with systems of several 
independent random variables. 

7-1. Bivariate Normal Considered as an Extension of One -dimensional 
Normal Distribution. In this section we study the normal bivariate dis- 
tribution as an extension of the one-dimensional normal distribution; 


‘Y Ztt a\ 


(x - a iY' 


(7-1) 


This density function is a symmetric function about the mean value ai. 
The exponent is negative for all values of the real variable x except for 
X = Oi. This can be alternatively expressed by referring to the mathe- 
matical term positive definite. We say that [(j — is a posi- 

tive definite form in x — ai. This implies that for all real values of the 
parameter y = Xi — a the above form is positive except for y — 0. 

Some familiarity wit h linear algebra should suggest that as a generaliza- 
tion of the above concept, for a bivariate, the exponent will be of the form 

/(x,,xO = Cexp(-HQ) ' (7-2) 

where Q is a positive definite quadratic form in variables (xi — Oi) and 
(xj — flg), that is, 

Q = i4ii(xi — fli)^ -|- 2 ^i2(xi — fli)(x2 — fflz) -|- Aii(xi — 02 )* (7-o) 

The real coefficients (i, j = I, 2) should be such that Q remains posi- 
tive for all real values of Xi and X 2 , except for xi — oi = X 2 — 02 = 0- 
The expression in Eq. (7-3), which is a quadratic form in variables 
yi = Xi — oi and J/2 = Xs — 02, can be written as 

248 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


249 


2 

Q(1 /i, 2 /j) = X ^<1 = (7-4) 

1,J = 1 

QiyhV’d — + 2yii2//i2/2 + i422/y2® (7-5) 

The form shown in Eqs. (7-4) and (7-5) is the proper generalization of the 
one variable form Q{ij) = Aif. The reader who wishes to aeciuire intro- 
ductory information about quadratic forms may refer to texts on linear 
algebra and matrices.* 

The coefficients can be conveniently arranged in a matrix form ; 


Ml 


'^11 
A 12 


A 12 
A 22 




(7-6) 


The quadratic form associated with the real matrix can be writlen as 

'"1 = Y'AY (7-7) 



QiuuUi) = [yi,!J-2] 4“ 

12 

A 12 
A 22 

where 

Y = 

y\ 



JJ2_ 

and i 

s the transposed Y matrix. 



standard texts on matrices point out that the necessary and suf- 
ficient conditions for a quadratic form to be positive definite are the 
positivem^ss of all leading principal minors of the associated matrix A, 
that is, 

/In > 0 

A r. > 0 (7-8) 

L4| = yluAjj - .1,..== > 0 


Next, one should examine Eq. (7-2) to justify the requirement for a two- 
dimensional density distribution. A comparison with Eq. (5-58) yields 


^ A -1 

2 a* 


1 ^ I 

2 ~k 


^ A - 1 
^An- fji 


(7-9) 




Va 


_ 1 
2 


f{xi,X 2 ) = exp { - 7- [i‘lii(xi - aiY -h 2yli2(xi - ai)(x2 - a^) 


+ il22(X2 — flz)*]} (7-10) 


In the above form one identifies l/i4ii(l — p^) as the variance of Xi, 
I/^22(l — p^) as the variance of X2, and — pV^i 2(1 — p^) asthecovari- 

*See also R M. Reza and Samuel Seely, “Modern Network Analysis,’' chap. 3, 
McGraw-Hill Book Company, Inc., New York, 1959, 



250 


CONTINUUM WITHOUT MEMORY 


ance between Xi and X2. The parameters a\ and a2 can be identified 
with the means of Xi and X2, respectively. The final result of this 
normalization procedure is 


f{Xi,X2) = - exp 

Z 7 rO’iCT 2 V A — 




2p - «t)(-r2 - 02 ) (-Tz - ffla) 
<Ti(T 2 <T2“ 


(7-11) 


- 'i- - - R| <■ - vra ‘’f ' 

This is the density function for the normal bivariate random variable in; a 
normalized symmetric form exhiVuling all its pertinent statistical parame- 
ters. The coefficient p(ti(T2 is the covarianc(i of the two one-dimensional 
random variables Xi and X2, and p th(ar correlation coefficient. Tn a 
particular case when the two random variables A'l and Xn are statistically 
independent (for instance, when two independent effects of one experi- 
ment are considered), p = 0 and 

// N 1 [ {JTi — aOn 1 

cxp[- 

oxpf-— CT --1 P- 13 ) 


f{Xi,Xi) = - 7 -^ oxp 
\/ Ztt cr I 


This result is of some interest in our subsetpient work. Iti states that, 
when the two sampling variables of a normal bivariate are mutually 
independent, the two-dimensional normal distribution reduces to the 
product of two distributions of single variables. 

7-2. Multinormal Distribution. The procedure developed in the 
previous section can be directly generalized to the case of rt-dimensional 
normal distributions. The n-dimcnsional normal density function is of 
the form 


f{Xi,X2, . . . ,Xn) = C„ exp 

Cn is an appropriate constant and Q a quadratic polynomial in 

Uk = Xk — aK 

that is, 

n 7? 

Qi.y^iy^} • ■ ■ jZ/ti) ~ ^ Q>j) ^ 


A^J{x^ — a,){Xj — aj) = ^ A^Jy^yJ 

tj = i 


(7-14) 


(7-15) 


The matrix A being a real symmetric matrix, 


A 11 

-4 12 

-4 13 

-4 In 


A 21 

-4.22 

-423 

-4 2n 

(7-16) 

A nl 

-4n2 

-4^3 

A nn_ 




NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


251 


the (Quadratic form can be written as 





where 



(7-17) 

(7-18) 

(7-19) 


L?y« J 

and P is its transpose. 

As a f^eiieralization of the concept of Eqs. (7-1) and (7-10), Q must 
remain positive for all nonzero real values of /yi, ^ 2 , - - . , y^. This 
reciuirement is satisfied if, and only if, A is a positive deliiiite matrix, 
that is, all its principal minors are non negative. 


An > 0 

An 

A 21 

A 31 


All A 12 

A 21 A 22 


> 0 



(7-20) 


Determinant of [A] > 0 

In order that Eq. (7-14) represent a normal distribution, we must have 
/--/-« ’ ’ ' /_„ • • • 7-r-.) = 1 (7-21) 

exp . . . ,y„)] 

dyi diji ■ ■ ■ d!i„ = 1 (7-22) 

The value of C„ can be determined from Ecp (7-22) for any given matrix 
[j 4] and set of real numbers ai, Oz, . . . , o„. The detailed calculation of 
Cn recjuires space which is not presently available. The interested reader 
is referred to C'ram^r, Lauing and Battin, or Wilks. It can be shown 
that the constant C„ has the value 

(7-23) 

dXn = 1 (7-24) 


VIS r r 

J_, 


r = 

" ( 2 ^)"'=“ 




{Xi - (h){xj — a>) dxi • 



252 


CONTINUUM WITHOUT MEMORY 


The average value of the variable Xk is a^. The variance of Xk and the 
covariance of Xi and Xj are found to be (Wilks) 


T,* = E[(X, - a,y] = CQf actor of A,, 

Covariance of Xi and Xj = E[(X^ — at){Xj — a^)] 

cofactor of A^,J . 

= (7-26) 

111 the particular case, when all pertinent random variables Xi, 

. . j Xn are mutually independent, we have 

1 

Covariance of Xt and Xj = 0 = 0 i 9 ^ j (7-27) 

that is, A is a diagonal matrix. The variances become 

fc=l,2, ..,,u (7-28) 


The normal multivariate density of n mutually indepcuident random 
variables is given by 


fr ^ V|4 r 

f{xi,X 2 .r„) = exp 

j 1 

< 710-2 • 


2 ^ Aik{Xk - UkY 

ifc = l 

O^n 


n 



(7-29) 


That is, in the case of mutually independent normal variables the 
joint density distribution is the product of n one-dimensional normal 
distributions. 

7-3. Linear Combination of Normally Distributed Independent Ran- 
dom Variables. The object of this section is to exhibit a most useful 
property of gaussian distributions. It will be shown that any random 
variable consisting of the linear combination of several normally dis- 
tributed independent random variables has itself a normal distribution. 

Let Xi and X 2 be independent random variables with normally dis- 
tributed density functions with zero means and standard deviations 
(Ti and (72 ■- 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


253 


fi(xi) = -/=- (7-30) 

V 27r (Ti 

h{xi) = ^ ( 7.3 

V 27r 0-2 

Our problem is to find the density function for the random variable 


/2(X2) = 

V 27r 0-2 


y = aiXi + 02 X 2 (7-32) 

where ai and ao are real numbers. According to the rules of transforma- 
tion of variables (Sec. 5-11), the demsity functions for random variables 
Yi = ttiXi and Y 2 = are, respectively, 


V 27r 0L\ 

1 

— - exp 
\/27r 


m 

(-IS) 


ai = ttiffi 


Of2 — fl20’2 


(7-33) 

(7-34) 


The density function for the random variable 

y = Fi + F 2 ( 7 - 35 ) 

can be obtained by convolution, as described in Chap. 0, since Fi and F 2 
are independent variables. Thus, 

piy) = My - yi) ■ My^) dy^ 

where /i( 2 /i) and/2(7/2) are the densities of Fi and F 2 , respectively. 

- /-? sh , [ " “ 2 ;" “] “p (- !£■) 

= /r “p [ - 

/■ + " 1 ( V 

J-. 27ra,a2®^P i ~2a,W' V “ a/ + 

r 1 ™..2 11 


-i: 


-M [ '\/2Tr aiUi 




[ ai^ + ( 

"^p[- -2My^~ 


\/2^(ai* + “ 2 *) 


exp - 


aci* + a^) 
1 


2{ai^ + a2^) 


“j dyJ^ 




But 



254 CONTINUUM WITHOUT MEMORY 

Thus the desired density function p(i/) is 


piy) = 




-v/2jr(ai* + cci®) 


( 7 - 37 ) 


That is, Y also has a normal distribution with zero mean and standard 
deviation 


(T = \/ 

The above proccnlure can be applied repeatedly in order to obtain tt^e 
density function of the linear combinations of several normally distributed 
independent random variables. Thus tlu^ following theorem can be 
established by induction. 

Theorem. Let Xk be normally distributed independent random vari- 
ables with zero means and variance that is, 

/(j-*) = --4-— A: = 1, 2, . . . , n (7-38) 

V 27r ak 

The random variable Y obtained by a linear combination of Xk, 

Y = a\Xi + a2X2 + * ■ * -\- (inYn (7-39) 

is a normally distributed random variable with zero mean and standard 
deviation a, that is, 


Variance of F = = aiVi^ + a^G 2 + ‘ 

p(?/) = / 

\/27r(Q:i2 + a2^ + • ’ ’ + 

exn r t ] = 

2W + cl2^+ • • • -bOj 

where otk = UkCk fc = 1, 2, . . . , n 


( 7 - 40 ) 


( 7 - 41 ) 

( 7 - 42 ) 


7-4. Central -limit Theorem. In the previous sections we have dis- 
cussed how the linear combination of n independent random variables 
with normal distribution leads to a normal distribution in n-dirnensional 
space. The engineering implication of this theorem is of great sig- 
nificance. In many applications we may be able to assume linearity or 
su'perposition of the effects of several independent causes. In such prob- 
lems, if each cause has a normal distriljution, the interpretation of the 
above theorem is justified. We may be concerned with the addition of 
signals in an adder, or the scries or parallel combination of a number of 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


255 


components (Fig. 7-la and h). In each case some kind of law of addition 
may hold, and this law may be exploited to obtain the probability density 
of the over-all sum of the effects. 

In the stated theorem of Sec. 7-3 Ave 
have assumed a normal distribution 
for each variable. In this section 
we remove this restriction and show 
that under reasonably general cir- 
cumstances the density distribution 
of the sum of n random variables 
approaches normal distribution as n 
is greiitly increased. To be more 
exact, the following very significant 
theorem, called the central-limit 
theorem, holds. 

Theorem. Let AT (/c = 1 , 2, . . . , n) be mutually independent random 
variables with identical density distribution functions with a given 
finite average mi and standard deviation (t\. Then as is increased the 
density function of the variable 

X = Xi Xi -{■ • • * + A„ 



( 6 ) 

Fk;. 7-1. in) A niinibrr of rjiiuloin quan- 
tities sumnu'd up in an adder; (6) a 
series eombination of a ninnl)er of phys- 
ieal eJements. 


will asymptotically approach a normal distribution with 
m = nwi <T = \/n ai 
That is, for any real pair of numbers (a < b) 


lim P 


a < 


X — nmi 

a / nai 



-j= r 

J a 


(7-4:1) 

(7-44) 


Proof. The proof of this theorem is somewhat lengthy. In the follow- 
ing we only sketch a proof, but for details the reader is referred to Cramer 
(p. 214). 

Consider the standardized random variable 


^ _ X - m _ 1 V 

Vni/ 

1 


k - r ni 
o-i 


(7-45) 


According to Sec. 6-10, the characteristic function of Xo can be deter- 
mined by multiplying the characteristic functions of the variables 
(Xfc - mOAi for /c = 1, 2, . . . , n. Thus the results of Example 6-10 
suggest : 

Characteristic function of X* — mi = 1 + 0 — ~ + PR{t) (7-46) 






256 


CONTINUUM WITHOUT MEMORY 


where ►O as ♦O. (See Hardy, “Pure Mathematics,^^ p. 289, 

sec. 151.) 

Then the characteristic function of the random variable Xo is 


n 


4>a-„(0 


n-c-^=)=[--£ 


+ ~R 

n 





As n is infinitely increased, the neglected terms become very small for 
any fixed finite t (sec Cramer). Now the limit of (1 — when W 

approaches infinity is therefore 


»*•(')- = . 1 ” [' - £ + 5 « ]■ - 


The characteristic function of Xo will asymptotically approach the value 
of 

The above fundamental theorem was known to Laplace at the begin- 
ning of the nineteenth century, but its formal proof was given a hundred 
years later by Liapounoff. Today there is a large class of associated 
theorems known as central-limit theorems. It can be shown that it is 
not necessary for the random variables Xk to have the same type of dis- 
tribution. The theorem holds under certain very general conditions 
which are beyond our present scope of interest. 

In conclusion, the reader should have acquired the feeling that in 
engineering problems dealing with a large number of statistically defined 
component s one may be able to study the over-all behavior of the system, 
subject to a number of plausible assumptions and constraints. 

Example 7-1. In Fipj. E7-1, a number of independent noise voltages Vxii = 1, 2, 
. . . , n) are received in the adder; that is, 


1 = 1 

Each noise voltage has a uniform distribution in the interval [— il,-|-ill: 

V. = ^ for |i»t| <A i = 1, 2, . . . , 71 

Vi ™ 0 elsewhere i = 1, 2, . . . , ti 

(o) Determine and plot the distribution function of V iox n ™ 2. 

(6) Same question as (a) for w = 3. 

(c) Same queetiop (a) for n pitich larger than 3. 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


257 


Solution 

(o) Let the probability density of V\, Fa, and V » Fi + Fa be, respectively, 




J_ 

2A 

2A 


— A < vi < A 


= fi(vi) *f2{Vi) 


As our readers are generally familiar with unilateral Laplace transforms, wc employ 
that technique. 


£/.0-0 = /'’.(») = ~ (J 

■ ^2(8) = - 2 -f c 2^^-) 


Let 'W_ 2 (v — A:) and U-i{v — /r) stand for unit ramp and unit parabola applied at 
V — k, respectively ; then 

m = + 2 ^) - U^V) + '«..(!■ - 2 . 4)1 

The sum of these thn'e unit ramps has the triangular shape illustrated in Fig. E7-la. 
(h) The extension of the method deseribed in part (a) yi(‘lds 

/(") = £-' (f*’ - e-'”)’! 

= + H.4) — U)(,_a(« + .4) + 3 i/_3(w — A) — u.»iL< — ^.■Dl 

Tlie sum of these four parabolas leads to the probabdity density curve shown in 
Fig. K7-\b. 



(a) 



( 6 ) 


Fig. E7-1 



258 


CONTINUUM WITHOUT MEMORY 


(c) For very large n, the density of V is an asymptotically normal curve with the 
mean m = nnh and standard deviation tr = <^0 where 7ni and <ri are the mean and 

the standard deviation of each random variable Vk. 

m = nm\ = 0 

o = = (2^1 j-A ‘ ^ V3 

Example 7-2. An honest die is rolled 1,000 times. What is the probability that 
the total score is between Il,5ri0 and .‘1,450? / 

Solvtion. liCt Xk be the random variable associated with the Ath throw of the (iie, 
and Sn the total score of n trials; tlnm 

E ( X ,) = 1 + ^ + 3 + 4 + 5 +JG ^ 3 5 
var (Xk) = (P + 22 + :}2 + 42 + 52 + 52)1;; _ ;{_5 = 

n 

S„ = ^ Xt 

k-l 

The standardized random variable* is defined as 

Sn — II. 5n 
Ao = ~ 

When n is reasonably large, then, according to the ci'iitral-lirnit theorem, the dis- 
tributions of Xo and Sn/n approach that of a normal variate. More specifically, 

P | 3.450 < ^ < 3 . 55 o| ^O.C8 

7-6. A Simple Random-walk Problem. As an application of the 
central-limit theorem, let us consider the problem of a random walk on a 
straight line. A moving object starts from the origin of the abscissa on 
this line and makes a sequence of random unit steps in either a positive 
or a negative direction. The probability of the object moving in the 
positive direction has a given value p for each step, independent of the 
previous step. If X, denotes the zth step, the position of the object 
after n steps is given by the random variable: 


A = + X 2 + ■ ■ ■ + (7-49) 

Each random variable X^ assumes the value of +1 or —1 with respective 
probabilities p and 1 — p. Therefore, 

Xt = 1 ■ p — 1(1 — p) = 2p — 1 i = 1, 2, . . . , n 

X7 = 1 ■ p + 1(1 - p) = 1 ^ = 1, 2, . . . , n (7-50) 

var X, = 1 — (2p — 1)^ = 4p(l — p) i = 1, 2, . . . , w 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


259 


The distribution of the variable 


X - n( 2p - 1) 

2 \/ np(J — p) 


(7-51) 


approaches a vstanclard normal curve. 

The probability that after n steps the object lies between two points 
with abscissa a and b {b > a) can be computed from a normal distribu- 
tion table. For this, one first specifics the ranp;e (c,d) of .Yo corresponding 
to range (a, 6) of X; the use of the normal table follows. 


P\a <X <b] = P{c < Yo < d\ 


1 

\/ 27r 



dl 


(7-52) 


For example, when p = } fhe probability that, after n steps, the object 
lies in the range [ — 2 \/ n,2 \/ 7i] is 



= 0.9546 


Example 7-3. Find the prol)ability that a moving object obeying the above 
random-walk law is found after n steps in each of the regions described below: 

(o) [o, ] for p = M 

(c) n — 400 V — H 

[0,3001 

{(i) n = 400 P = M 

[300,400] 

(e) n = 400 p = H 

[ - 00 ,01 

Solution 


(o) 

^ Y - n(2P - 1) ^ Y 
2 \/nP(l — P) \/n 

f |o < X < = P |o < X„ < ^} = 0.1915 

(ft) 

(c) P(0 < X < 300) =P| -11.54 < Xo < 5.77) = 1 

id) P1300 < X < 400) = PI5.77 < Xo < 11.54) = 0 

(e) P{-=o <X <0) = PI- « <Xo < -11.54) «0 


7-6. Appro|imation of the Binomial Distribution by the Normal Dis- 
tribution. You may have employed the binomial and the normal dis- 



260 


CONTINUUM WITHOUT MEMORY 


tribution for solving the same problem and have frequently noted that 
the two distributions lead to approximately the same numerical answers. 
In the present section we are in a position to investigate the circumstances 
under which these two distributions may lead to practically identical 
results. 

Consider an experiment which has only two possible outcomes E and 
E\ Let 

P[E] = p P{E^] = \-p = q (7^3) 

If the experiment is repeated n times, the proVjability of obtaining\an 
outcome \ 

r n — r 

E, = 

irrespective of the order of the sequence is 



The probability distribution is of the binomial type. This fact can be 
alternatively expressed in the following way. 

Consider a random variable X^ associated with the fth experiment. 
If the experiment gives rise to E, we assign a value of 1 to X,, otherwise 
zero. The probability distribution associated with assumes either of 
the two values p and q. The average and the standard deviation of X, are 
given by 

E{X.) = 1 P{X, = 1} +0-P1X. = 01 

= ]’P + 0q = p (7-55) 

E(Xt^) = P ■ p + 0® • (7 = p 

Variance of X, = p — p^ = pg (7-56) 

Consider next the random variable X associated with the original 
experiment and its probability distribution. The random variable X 
has the binomial distribution given in Eq. (7-54). However, X can now 
be considered as the sum of a number of independent random variables, 
i.e., 

X = ^ X, 

»= 1 

According to Sec. 6-8 we have 

E(X) = E (2 X,) = 2 *’(^.) = np (7-57) 

i = 1 1=1 

‘ = npq (7-58) 

The central-limit theorem asserts that as n is increased the distribution of 
the standardized random variable Xo asymptotically approaches a normal 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


261 


distribution. That is, the random variable 


Xo = 


X — nj) 
y/ npq 


is normally distributed with Xo = 0 and ax, = 1. 
tribution can be approximated by 


The binomial dis- 


1 f ~iiv) /\^n 

P{a <X<b\^-,= \ 

binomial y/ J {a-np) / 


V (It 


(7-5!)) 


The approximation is rather satisfactory even for small values of n if up is, 
say, larger than 10 or so. The result of this section can he (‘xpressed in 
the form 



a<r<b 

binomial 


1 

— 

\/27r 


i 


(,b~~ fip) pi/ 


(a — np) 


(It 


(7-60) 


The right side of this equation presents the area under the normal curve 
of the standardized variable between the above limits. The left side can 



9 ^ 


1 ^ 






Fig. 7-2. A normal approximation to binomial distribution. 

also be represented by an area. In fact, consider the elementary rec- 
tangles whose bases are equal to the difference between two successive 

r’s and whose heights are y/ npq p^q^~^ (Fig. 7-2), where the area of 

each elementary rectangle still equals ^ The approximation 

is rather good when p is not close to either 0 or 1. Note that for p = 
the graph of the binomial is symmetric about its mean. This is not true 
for p ^ q'he normal curve, however, is of course symmetric about 
its mean value. 

Fzample 7-4. Compare the binomial distribution 

- (^“) P'9'"-' 



262 CONTINUUM WITHOUT MEMORY 

with the normal distribution for each of the following cases; 


V = M 
P = V2 


For the binomial distribution assume an approximation of the following type: 


/■a+K 

/(n)binoniial “ / normal 

J a- }'2 


Solution. About each point with abscissa r (r = 1, 2, , 10) construct a 

rectangle with an area of ('”) ^ The base of the ret;taiigle is taken to be c^ual 

to 1 / \/l0pg. Equation (7-60) can be directly applied for computing th(' probabilities. 
The evaluation of the areas of the rectangles and that under the. normal curve will 
show that the approximation is rather good. 

Example 7-6. An ordinary coin is tossed 000 times. What is the probability 
that the number of heads will be less than 420? 

Solution. The problem requires thc‘ evaluation of 


r{X < 420) 

whore X is the number of heads. To evaluate this probability, we us(' a normal 
approximation to the binomial distribution. 


Then 


n = 900 p — q = 
<r == y/npq = 15 
„ X - 450 
^ 15 


is asymptotically normally distributed. 


P|X < 4201 = /^|A^o < -2) 




-2 1 


^ dl = J [1 - 2<t>(‘2)] = 0.0227 
yj2Tr 2 


where 0(a) denotes the area under the standardized normal curv(' between 0 and a. 


7-7. Approximation of Poisson Distribution by a Normal Distribution. 

As the second application of the central-limit theorem we approximate 
the Poisson distribution by a normal distribution, ^t the random varia- 
ble X have a Poisson distribution 


In Chap. 6, it was shown that the first and the second moment of X are, 
respectively, 

E{X) = X 
E{X^) = X + X* 

= X 


and 


(7-62) 

(7-63) 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


263 


The standardized random variable 


X« 


X - X 

Vx 


has the following moment-generating function : 


= exp (0 

= exp (Xc‘ — \ — \t) 

</>i,(0 = exp _ X - Vx <) 

= exp(x + Vx< + ‘’ + -p^„+ 



(7-64) 

(7-65) 



As X-^ oo, the moment-generating function of A’^o approaches that of a 
standardized normal random variable, that is, 


lim P \a < < 1)1 = - f dx (7-67) 

I Vx I V2tJ^ 

7-8. The Laws of Large Numbers 

Weak Law of Large Numbers. The interpretation of this law implies 
that, if a random experiment is repeated a large number of times, the 
average of the results will differ only slightly from the expected value of 



m—k m m-hk 


Fig. 7-3. An illustration of the law of large numbers. 

each experiment. For instance, if an honest coin is tossed n times, as 
n — > 00 the average of the results of our experiment , say the freciuency of 
the recorded number of heads, will tend to which is the expected value 
of the variable. More exactly: 

Theorem. Let Xi, X 2 , . . . , X„ be independent random variables 
such that 

£;(X0 = w 1 = 1, 2, . . . , n (7-68) 

(Tx, = O’ 


264 


CONTINUUM WITHOUT MEMORY 


Then for any positive fc, the random variable 

n 

= - y X (7-69) 

n Z-V 

t=l 

satisfies the inetiiiality 

Flm - k <Sn <m + k] 1 (7-70) 

That is, (A^i + .Y 2 + • ■ ■ + Xn)/n will approach m with probabiliiiy 1 
Proof. The proof follows directly from Chebyshev^s inequality. ^In 
fact, the statement of the theorem implies that i 

E{Sn) = m 

(T 

= - / 

\/ n 

Thus, the application of Eq. (6-25) yields 

P\\Sn - ml > k] < (7-71) 

But, for a given pair of a and /r, the ratio of a^/nk^ tends to zero as n 00 . 
Therefore 

P\\Sn - ml > fc) ^0 (7-72) 


This inequality can also be written in the alternative form 

(7-78) 

Strong Law of Large Numbers. The following theorem is given without 
a proof. 

Theorem. Let Xi, ^ 2 , . . . , X„, . . . be an infinite sequence of 
independent random variables such that 

E{X,) = m 1 = 1 , 2 ,... 

(Tx, = (T 1 = 1 , 2 , . . ^ 

Let 0 ) be the set of points of the sample space for which 


Xi(a;) + ^2(07) + 


+ X„(a7) 


Then oj has probability 1 (is said to be almost certain). The strong law 
implies that the limit of the average approaches the common expected 
value of the afore-mentioned independent variables. For proof see 
Loeve and Fortet. It should be kept in mind that the laws of large num- 
bers and the central-limit theorem can be stated under more general cir- 
cumstances than those assumed in this section. 



NORMAL DISTRIBUTIONS AND LIMIT THEOREMS 


2G5 


PROBLEMS 

7 - 1 . An honest coin is tossed 1,000 times. 

(a) Find the probability of a head occurring in loss than 500 times. 

(b) Find the probability of a head occurring less than 500 but more than 450 times. 
Use the normal approximation. 

(f) Same question as (b) but use Chebyshev^s inequality. 

7 - 2 . X is a random variable with binomial distribution (a = ,‘100, p = I.3). 
Approximate P lx < 100 j. 

7 - 3 . For a random voltage V assume the following values with their respective 
probabilities : 

[-2, - 1 , 0, 1, 2 I 

[0.05, 0.25, 0.15, 0 45, O.lOl 

(a) Find V. 

(b) Find the standard deviation of V. 

(c) If th(‘ voltage V is applied across a 2-ohin resistor, determine the average* power 
W dissipated in the resistor in unit time. 

(d) What is the probability of the powcT W being hvss than 0.10, 0.20, 0.50 watt, 
respectively? 

(e) The voltage V is applied to the sam<; resistor but only for a pi-riod of micro- 
second. Now suppose that this experiment is repeated 10“ time's. Find the* proba- 
bility that the total dissipated power will remain betweiui W\ ami i/;-. watts. 

(/) Calculate part (e) for 


= 0.01 mj2 = 0 02 
Wi = 0.05 W2 = 0.10 

liUi = 0 10 W 2 — 0.15 

7 - 4 . The independent noise voltages Fi and V 2 are added to a d-c signal of 10 volts 
after going through amplifiers of parameters h'\ and k-i, respectively. Find the density 
of the output 

V = A'lUi “|- "h 10 

(a) V\ and are normal with parameters (if, 1) and (— 2,.‘i). 

A:, = 2 /f2 = 3 


(5) V\ and Fa are uniformly distributed between 0 iuid 1 volt. 

A-, =2 lf2 = 

7 - 6 . Let IJ and V be the output of two adders. The input to each adder is obtained 
through a number of linear netAVorks not necessarily independent, that is, 

n tJ 

f; = 2 “‘S'* ^ = 

(a) Determine the standard deviation of U and F. 

(b) Show that the covariance of the output of the adders is 



n n 



i-1 y-1 


where is the covariance of Xi and Xj. 



266 


CONTINUUM WITHOUT MEMORY 


7-6. Let Sn be the average of n identically distributed random variables with 
binomial distribution {p,q). Prove that 




7-7, Let X be a random signal with {m^<r). Using Chebyshov’s inequality, show 
(o) P\\X — m\ < a} >95 per cent for a > 4.5<r 

(b) P\\X — ml < aj > 99 per cent hir a > ]()a 


(c) P I fluctuation 

(d) P{|- 


I X — m 1 


X - ml 


m 

> 99 per cent 


< a| >95 per cent for a > 4.5 


m| 


for a > 10 1 ^ \ 

|ml ^ 


(e) When the signal is approximately normally distributed, the numbers 4.5 an^ 10 
in the above inequalities could be respectivc*ly redacted to 1.90 and 2.58 [see Parzen 
(Chap. 8)1. 

7-8. Show that the characteristic function of an n-dimonsional normal distribu- 
tion is 


71 71 

. . - ,in) = ^^P(“ 2 X Z 

" r=] J=1 


where is the second joint moment of A"* and X^, and 

Xit = 0 for A; = 1, 2, . . . , n 

7-9. Consider a normally distiibuted variable with the probability density 

f(Xi,X 2 , . . . ,Xn) = CCXP (-HC?) 


where, as usual, 


n n 



Find the probability density of Q by using the concept of moment-generating func- 
tion and transformation of variables. 



CHAPTER 8 


CONTINUOUS CHANNEL WITHOUT MEMORY 


8-1. Definition of Different Entropies. In a preceding chapter we 
studied the transmission of information by discrete symbols. In many 
practical applications the information is transmitted by continuous sig- 
nals, such as continuous electric waves. That is, the transmitted signal 
(Fig. 8-1) is a continuous function of time during a finite time interval. 
During that interval the amplitude of the signal assumes a continuum of 
values with a specified probability density function. 

The main object of this chapter is to outline vsorne results for continuous 
channels similar to those discussed for discrete systems, principally the 
entropy associated with a random variable assuming a continuum of 
values. The extension of mathematical results obtained for finite, dis- 
crete syst(uns to infinite systems or systems with continuous parameters is 
(luite frequent in problems of mathematics and physics. Such extensions 
reciuire a certain amount of care if mathematical difficulties and inaccura- 
cies are to be avoided. For example, matrix algebra is (luite familiar to 
most scientists in so far as finite matrices are concerned. The same con- 
cept can be extended to cover infinite matrices and Hilbert spaces, the 
extension being subject to special 
mathematical disciplines that require 
time and preparation to master. 

Similarly, in network theory, one is 
familiar with the properties of the 
rational driving-point impedance 

functions associated with lumped s-l. A .lontinuous siRiml. 

linear networks; but when dealing 

with impedances associated with transmission lines, one is far less knowl- 
edgeable as to the class of pertinent transcendental functions describing 
the impedances. (In fact, as yet there is a very limited body of work 
available on the extension of existing methods of network synthesis from 
lumped-parameter to distributed-parameter systems.) 

This pattern of increased complexity of analysis, requiring special 
mathematical consideration for passage from the finite to the infinite and 
from the discrete to the continuous, also prevails in the field of informa- 
tion theory. 



267 



268 CONTINUUM WITHOUT MEMORY 

One method of presentation is to extend the definitions for entropies 
from discrete to continuous cases in a way similar to the presentation of 
the probability of discrete and continuous random variables. In fact, 
we have already used such a technique for defining the expectation of 
a continuous random variable. This analogous presentation has the 
apparent merit of simplicity and convenience, but at the expense of not 
always being well defined from a strictly mathematical point of view. 
Also, the engineering significance of the entropy of a continuous random 
variable becomes somewhat obscure, as is shown later. The mjithe- 
matically inclined reader may find it more tenable to start this discutesion 
with the definition of the mutual information between two random objects 
each assuming a continuum of values. The procedure has been outlined 
in Shannon’s original paper as well as in a fundamental paper of A. N. 
Kolmogorov (see also the reference cited on page 289). 

The definitions of the different entropies in the discrete case were 
based on the concept of different expectations encountered in the case of 
two-dimensional discrete distributions. In a similar way, we may 
introduce different entropies in the case of one-dimensional or multi- 
dimensional random variables with continuous distributions. For a 
one-dimensional random variable, 

H{X) = Ji[- \ogf{X)\ = - pj fix) \ogfix) dx (8-1) 

The different entropies associated with a two-dimensional random varia- 
ble possessing a joint density /(. t, 2/) and marginal densities /i(x) and f 2 {y) 
are 

H(XJ) = E[- log/(X,F)] 

= - fP fP f{x,y) log fix,y) dx dy (8-2) 

H{X) = E[- log/,(X)] = - jp A(x) log/i(x) dx (8-3) 

H{Y) = E[- log MY)] = - IP My) log My) dy (8-4) 

HiX\Y) = El- \og MX\Y)] 

= - /_V /_r pp dx dy (8-5) 

H(ylX) =E[- log /.(FIX)] 

Finally, as will be pointed out in Chap, 9, for an n-dimensional random 
variable possessing a probability density function f{xijX 2 , . . . f^n)f 
the entropy is defined as 



CONTINUOUS CHANNEL WITHOUT MEHOBT 


269 


mXi, . . . ,Xn) = E[- log/(X,, . . . .X-,)] 

/ + ® f -i- CO 

— ■ • • /_, /(*>. ■ ■ • .*») 

log/(.ri, . . . jXn) dxi • ■ • dxn (8-7) 

All definitions here are contingent upon the existence of the correspond- 
ing integrals. 

8-2. The Nature of Mathematical Difficulties Involved. In this sec- 
tion we describe the mathematical difficulties encountered in extending 
the concept of self -information from discrete to continuous models. 
There are at least three basic points to be discussed : 

1. The entropy of a random variable with continuous distribution 
may be negative. 

2. The entropy of a random variable with continuous distribution may 
become infinitely large. Furthermore, if the probability scheme under 
consideration is “approximated” by a discrete scheme, it can be shown 
that the entropy of the discrete scheme will always tend to infinity as the 
quantization is made finer and finer. 

3, In contrast to the discrete case, the entropy of a continuous system 
does not remain necessarily invariant under the transformation of the 
coordinate systems. 

Of these three difficulties, perhaps the first one is the most apparent. 
The second and third require more explanation. For this reason we treat 
item 1 in this section but defer discussion of topics 2 and 3 to a later 
section. 

Negative FJntropies. In the discrete case all the entropies involved are 
positive quantities because the probability of the occurrence of an event in 
the discrete case is a positive number less than or equal to 1. In the 
continuous case, 


/_« dx I 
/- - dxdy = 1 


(8-8) 


Evidently, the density functions need not be less than 1 for all values of 
the random variable ; this fact may lead to a negative entropy. A situ- 
ation leading to a negative entropy is illustrated in Example 8-1, where 
the entropy associated with the density function depends on the value of a 
parameter. This is a reason why the concept of self -information no longer 
can be associated with H(X) as in the discrete case. We call H{X) the 
entropy function f but H{X) no longer indicates the average self -information 
of the source. 

Similar remarks are valid for conditional entropies. Thus it follows 
that the individual entropies may assume negative values. How- 



270 


CONTINUUM WITHOUT MEMORY 


ever, it will be shown that the mutual information is not subject to this 
objection. 

Example 8-1. A random variable has the density function shown in Fig. E8-1. 
Find the corresponding entropy. 





Fi(.. 108-1 

Solutinn 



- “) 

for a 

a -\- b 
<X<-^ 


for 

VI 

h 

VI 


H^X) - - /J 


(a + 6)/2 2/i 


{x — a) In 7 — a) dx 

6 — a b — a 

J(o+b)/2 b — a 


__ f (/, _ x) In (6 — x) dx 

J(o+b)/2 b — a h — a 


The above integrals can be evaluated by parts. In so doing, note that 

x]i\ \x dx = -^ In Xa; ~ 

Thus, H(.X) = [(i - o)= 111 — - - (x ~ a) - 

h — a\_ b — a 2 Jo 


^ 6 - o L b - a ' 2 Jc, 




(b - a)» 


! J(o+6)/2 

tA-1} 




and since 
thus 


hih - a) 
2 


= f(x) dx = 1 


The entropy depends on the parameter h, but a translation of the probability curve 
along the x axis docs not change its value. Note also that 

H{X) >0 for < Ve 

H{X) =0 for /i = Ve 

H{X) <0 fovh> 

8-3. Infiniteness of Continuous Entropy. Let Z be a one-dimensional 
random variable with a well-defined range [o,b] and a probability density 



CONTINUOUS CHANNEL WITHOUT MEMORY 


271 


function /(i), that is, 

P\c<X <d] = dx = Fid) - F(c) (8-9) 

We propose to examine the entropy associated with this random variable, 
following a familiar mathematical routine. That is, we divide the 



Fia. 8-2. A quant iziit ion of’ a coiitiiuious .siRiial for (‘Oinputin^ rnlropy. 

interval of interest between a and h into nonoverlapping subintervals 
(see Fig. 8-2) : 




(a,ai]; (ai 


• • ■ y 

ianM 


(8-10) 


a < 

ai < tto < 

■ • • 

^ dji 

dn ~{- 1 “ 

h 

(8-11) 

■ a = 

Aai, . . . , 

die dic—\ 

= Ann-, . . . 

, 6 — ttn 

= Aa„+i 






k 

= 1,2, . . 


(8-12) 

Pia 

< X < a,) 

= pM 

dx = 

F(a,) 

- Fia) = 

Pi Afli 

(8-13) 

P{a, 

A 

lA 


dx = 

Fia,) 

- Fia,) = 

= p, Aa, 

(8-14) 

Pia, 

A 

lA 

= I'm 

dx = 

Fih) - 

- F(o„) = 

Pn+i Aa „+1 

(8-15) 


Now we may define another random variable X,i, assuming only the 
discrete set of values 

[ai,a2, . . . ,anM (8-lG) 

with respective probabilities 


[pi Atti, p2 Aa2, . . . , Pn Aa„, p«+i Aa„+i] 


( 8 - 17 ) 


272 


CONTINUUM WITHOUT MEMORY 


According to Eq. (8-9) the events under consideration form a finite, com- 
plete probability scheme, as 

n + l 

y p* Aa* = F(b) — F{a) = 1 

t = i 

T> + 1 

Thus H{Xd) = — ^ p* Aca log Pit Aoit 

Now, let the length of each interval in Eq. (8-19) V)ecome infinitely small 
by infinitely increasing n. It is reasonable to anticipate that in the limit, 
when every interval becomes vanishingly small, the entropy of this dis- 
crete scheme should approach that of the continuous model. The process 
can be made more evident by adopting an arbitrary level of quantization, 
say 

Aai = Aak = Aa-n^i = Ax A: = 1, 2, . . . , r? -|- 1 (8-20) 

and evaluating the above entropy, 

rj + l rj + l 

HiXd) = - Y Pk Ax log Pk - Y Pk Ax log Ax (8-21) 

*=1 fc-i 

But I1{X) = lim IJ{Xd) (8-22) 

A *— >0 

Therefore when Ax is made smaller and smaller, while pk Aak, the area 
under the curve between a^-i and ajt, tends to zero, the ratio of the area to 
Aaic remains finite for a continuous distribution. In the limit 

w + l 

lim H{Xd) = — f{x) \o%f(x) dx — lim Y pjt Ar log Ax (8-23) 


(8-18) 

(8-19) 


assuming that the first integral exists. 

As the subintervals are made smaller by making n larger, the p* Ax 
become smaller but the entropy H{Xd) increases. Thus, in the limit 
when an infinite number of infinitesimal subintcrvals are considered, the 
entropy becomes infinitely large. The interpretation is that the con- 
tinuous distribution can potentially convey infinitely large amounts of 
information. We have used the word “potentially^' since the informa- 
tion must be received by a receiver or an observer. The observer can 
receive information with a bounded accuracy. Thus H{X) should prefer- 
ably be written as Ht(X), indicating the bounded level of accuracy of the 
observer. If the observer had an infinitely great level of accuracy, he 
could detect an infinitely large amount of information from a random 
signal assuming a continuum of values. 

In a manner similar to the definition of entropy in the discrete case, 



CONTINUOUS CHANNEL WITHOUT MEMORY 273 

we may define the entropy of a complete continuous scheme defined in 
[a,h] as 

H{X) = log/(.r) dx 

It is important to note that the integral of Eq. (8-1) defining the 
entropy of a continuous random variable is not necessarily infinite. 
However, the above limiting process, which introduces the concept of a 
discrete analog model with an infinitely large number of states, always 
leads to an infinite entropy. 

8-4. Variability of the Entropy in the Continuous Case with Coordinate 
Systems. Consider a one-dimensional continuous random variable X 
with a density function f{x). Let the variable X be transformed into a 
new variable 7 by a continuous one-to-one transformation. The 
density function p{ij) is 

pO/)=/w'^f^j (8-24) 

If it is assumed that the transformation Y = g{X) is monotone and single- 
valued, the entropy associated with Y is 

HiY) = - p{y) log piy) dy (8-25) 

"O'* - - /_T h) II |] l-S [/w 1 1 j] % (^26) 

J{x)\ogf{x)dx - /_/(a*)l0K dx (8-27) 

H{Y) = I-I{X) + fix) log I ^ I dx (8-28) 

The entropy of the new system depends on the associated function 
log \dx/dy\. This is in contrast with the discrete case. In the discrete 
case, the values associated with the random variable do not enter into 
the computation. P'or instance, the entropy associated with the throw- 
ing of an ordinary die is 

[X] = [1,2, 3, 4, 5, 6] 

[p] = [Hy^y^y^y^M] 

H(X) = Q(-H\ogH) = log 6 

A change of variable, say Y = X*, does not produce any change in the 
entropy since [P] remains unchanged: 

[X2] = [1,4,9,16,25,36] 

[p] = iHjHyHyyy] 

H{X^) = log 6 = H{X) 



274 


CONTINUUM WITHOUT MEMORY 


When the transformation of the continuous random variable is linear, 
that is, 

Y = AX + B (8-29) 

Eq. (8-28) yields 

//(F) = //(X) + fijc) log 1^1 dx (8-30) 

HiY) = H(X) -1- log 1^11 in bits (8-31) 

Equation (8-31) suggests that the entropy of a continuous random varia- 
ble subjected to a linear transformation of the axis remains invariant 
within a constant log |^|. \ 

Example 8-2. Find the entropy of a continuous random variable with the density 
function as illustrated in Fig. E8-2. 

fix) = bx^ 0 < j < a 

= 0 elsewhere 

Determine the entropy HiX\) when xy x d, d > 0. Answer same question for 
the transformation x-i = 2x. 



Fig. E8-2 


Solution. The value of b which makes the above f(x) a permissible probability 
density function is given by 


Note that 
Thus 



^ 3 Jo 



H(X) = - j“ bx‘ In hx‘ dx - 

/ X^ 

X® In Xx dx = — In Xx — ^ 

H{X) = -2b [I’ In X - J]“ = -26 1 (In -sA a - H) 
//(X)=-2(lnV|-li)=|+ln| 


The entropy may be positive, negative, or zero, depending on the parameter a. 


H(X) >0 o > 3e-54 

H(X) =0 a ~ 3e-n 

H(X) <0 a < 3e-W 



CONTINUOUS CHANNEL WITHOUT MEMORY 275 

Now consider the simple translation of the vertical axis, 

A"i=a:-|-d d>0 

The probability density curve will be simply d units shifted to the left. The entropy 
becomes 

H{X,) = -2b Vh (i. -d) - 

H(X^) = (in Vba- + In “ = //(.\) 

This result could have been predicted from Eq. (8-31), as 

log A = log 1=0 

For the transformation Xi = 2x, one finds 


H(X2) = 

H{Xi) = 


. . hx^ 6 , 

p(x.) =—=-!,* 

~ Jo 


I 2 ,2 

2 “ + 3 


This result is in agreement with Eq. (8-31), that is, 

H(Xj) = H{X) + 6i* 111 2dx = H(X) + In 2 

8-6. A Measure of Information in the Continuous Case. The material 
of the preceding two sections might lead one to think that the concept of 
entropy loses its usefulness for continuous systems. On the contrary, 
the concept of entropy is as important in the continuous case as in the 
discrete case. To put this concept into focus, one has slightly to reorient 
thoughts by putting the emphasis on the transinformation rather 
than on the individual entropies. The different entropies associated 
with a continuous channel, that is, //(X), H{Y), H{X\Y), H{Y\X)j and 
/7(X,F), have no direct interpretation as far as the information processed 
in the channel is concerned. However, it will be shown that the trans- 
information I(X]Y) retains its information-theory significance. Owing to 
fhis, we use the concept of the transinformation I{X;Y) of the random 
variables X and Y as the starting point in defining the entropy of con- 
tinuous systems. 

- f'. I'. 


(8-32) 



276 


CONTINUUM WITHOUT MEMORY 


Note that, as before, we have 


IiX;Y) = H{X) - HiX\Y) = H{Y) - HiY\X) 

= HiX) + H{Y) - H(X,Y) (8-33) 


Now we should be able to demonstrate how this measure of transinforma- 
tion does not face the three mentioned difficulties encountered in dealing 
with individual entropies. 

Transinformatwn Is Nonnegative. A proof of the validity of this ppp- 
erty can be obtained by using the basic inequality for convexity (of a 
logarithmic function. 


/(X;F) = f” / 

/ — oO / — 00 

^ ~ r r 


f(x,y) log dx dy 


f(^\y} 

- [Z® - ' ] ‘ 

= - / / My)fi{x) log « dx dy 

f 00 /■ 00 

/ / fix, I 

7 ”* 7 “* 


+ 

= 1 • 1 • log c — log e = 0 
Hence IiX]Y) > 0 


y) log c dx dy 


(8-34) 


Transinformaiion Is (Icnerally Finite. Consider the expression 


1 y.i < F < yj_jr 1 

T[x^ < X <x^ + Ax}P\yj < Y < yj + Ay] 


(8-35) 


which is a direct extension of the definition of mutual information in the 
discrete case. As Ax and Ay are made smaller, each of these probability 
terms tends to zero. This was the reason for the individual continuous 
entropies to tend to infinity in our passage from the discrete to the con- 
tinuous models. While each of the individual terms tends to zero, the 
above ratio remains finite for all cases of interests In fact, in the limit, 
the expression becomes 


log 


/i(^»)/2(2/j) 


(8-36) 


It is certainly reasonable to exclude the degenerate cases corresponding to 
densities which are not absolutely continuous. (See references given in 
the footnotes on pages 277 and 295.) 

To sum up, in passing from the discrete to the continuous model, each 
one of the entropies /f(F), H{X)j and H{XyY) leads to the calculation of 
the logarithm of some infinitesimal probabilities, thus leading to infinite 



CONTINUOUS CHANNEL WITHOUT MEMORY 


277 


entropies. However, the expression of Eq. (8-32) generally remains finite 
and will lead to a finite measure of entropy for the mutual entropy 

Invariance of Transinformation under Linear Transformation. Finally, 
we should like to show that, in contrast with self-informations, our meas- 
ure of mutual information, Eq. (8-32), remains invariant under all linear- 
scale transformations at the input and the output of the channel. The 
proof will follow by applying the general equations of transformation of 
variables; that is, let 


.Yi = aX + b 
= cY + d 


(8-37) 


Then 

H(Xi) = - piixi) log p^(Xl) dx\ 

H{Yi) = - Piiy,) log p.(,iy,) rf//, (8-38) 

7/(A'i,Fi) = - p(jci,!ii) log p(.ei,//i) dxidyi 


where pi(.ti) and piOjj) are the probability density functions associated 
with the variables Xi and Fi, respectively, and p(xi,)/i) is their joint 
density function. But according to Sec. 5-12, 


Pi(^i) = 


/i(a:) 


P'iiyi) — 1 -|- 


p{xi,yi) = 


___ 

\Jixi,yi/x,y)\ 


f { x,y ) 

a 0 I 
0 r 


fiXjjA 

loci 


(8-39) 


HiX^) = HiX) -b log |a| 

H(Yi) = H(Y) + log lc| (8-40) 

HiXiJi) = H{X,Y) -b log |ac| 

Thus /(Xi;Fi) = H{Xi) + H(F,) - «(Xi,Fi) 

= HiX) + HiY) - H{X,Y) -b log lo| 

-b log lf| - log loci 

= IiX]Y) (8-41) 


While the individual entropies may change under linear transformations, 
the transinformation remains intact. 


* I (X]Y) is finite when all densities are absolutely continuous the transinfor- 
ination may become infinite in some extrinsic circumstance.s (see, for instance, IRE 
I'nns. on Inform, Theory, December, 1956, pp. 102—108, or theorem 1.1 of the Gel fand 
and laglom reference cited on page 295. 



278 


CONTINUUM WITHOUT MEMORY 


Thus we have removed all three objections by selecting the concept of 
transinformation for the basis of our discussions.* 

The following elementary properties of transinformation /(X;F) are 
self-evident : 


7(X;F) = /(F;X) (8-42) 

7(X;F) > 0 (8-48) 

7(X;F) =0 ii X and F are independent variables (8-44) 
77 (X) > HiX\Y) (^-45) 

77(F) > 77(F|X) (i46) 

n(X) + 77(F) > 77(X,F) (847) 


8-6. Maximization of the Entropy of a Continuous Random Variable. 

The maximum entropy of a complei.e discrete scheme occurs when all 
the events arc etiuiprobablc. This statement is not meaningful in the 
case of a random variable assuming a continuum of values. In this 
case, it is quite possible to have entropies which may not be finite. This 
might be interpreted as a pitfall for the definition of the channel capacity. 
However, the situation can be improved by assuming some plausible 
constraint on the nature of the density distributions. For instance, if 
the random variable has a finite range, then OIK*, may ask what typci of 
density distribution leads to the greatest value of entropy. Such ques- 
tions can be answered by using mathematical maximization techniques 
from the (calculus of variations, such as the method of Lagrange multi- 
pliers. We shall employ this method in the subsequent sections for th(^ 
following three basic constraints. 

Case 1. What type of probability density distribution gives maximum 
entropy when the random variable is bounded by a finite interval, say 
a < X <h? 

Case 2, Let X assume only nonnegative values, and let the first 
moment of X be a prespecified number a (a > 0). What probability 
density distribution leads to the maximum entropy? 

Case 3. Given a random variable with a specified second central 
moment (or a specified standard deviation cr), determine the probability 
density distribution that has the maximum entropy. 

* This is indeed in accordance with our fundamental frame of reference. To 
measure something, one must have a basis of comparison. The “arithmetical ratio” 
of this comparison gives an indication of the relative measure of the thing that is being 
measured with respect to some adopted unit. 

Similarly, in information theory, it is the difference of some a priori and a posteriori 
expectation of the system that provides us with a measure of the average gain or loss of 
knowledge or uncertainty about a system. In problems of information theory, the 
above “arithmetical ratio" in turn is translated into the difference of two entropies- 
This is of course due to the use of the logarithmic scale. 



CONTINUOUS CHANNEL WITHOUT MEMORY 


279 


By using such techniques, the following answers will be obtained. 

Case 1. The maximum entropy is associated with a random variable 
with a uniform probability density distribution between a and b. 

Case 2. The maximum entropy corresponds to an exponential proba- 
bility density distribution of the form 

1 , 

_ £—xlu 

a 

Case 3. Among the specified class of probability density functions 
the gaussian distribution 

has the largest entropy. 

8-7. Entropy Maximization Problems. Now we shall employ the 
variational technique for solving cases 1, 2, and 3 of the previous section. 
1 . One has to maximize 

f1{X) = — f{x)]nf{x) dx (8-48)* 

subject to the constraint 

1^' fU') 1 

Let X be a constant multiplier; the unknown solution /(.c) must satisfy 

-|(/ln/) + x|^(/) =0 (8-49) 

that is, - 1 - In /(.r) + X = 0 

J{x) = (8-50) 

As X is a constant, this equation shows that the required distribution 
must be uniformly constant in the interval (a, 5). The value of this 
constant can be found directly. 

/(x) = ._1— a <x<b 
b — a 

The associated maximal entropy is 


= 111 (b — a) 


-In 

a 



dx 


(8-51) 


I^'or a continuous random variable bounded to a finite interval, the uni- 
form probability density provides the maximum entropy. 


* The use of the natural logarithm here is for convenience in algebra. 



280 


CONTINUUM WITHOUT MEMORY 


2. When the expected value, that is, the first moment of the continuous 
random variable X > 0, is specified as EiX) = a, the unknown function 
f{x) is subject to the following constraints: 

H{X) = - f‘ fix) Infix) dx (8-52) 

/(a;) dx = 1 

xfix) dx = a a > 0 i 

Using the method of Lagrangian multiplitirs, we find \ 

+ X^(/x) = -(l-f-ln/)-hM + Xa- = 0 (8-^3) 

fix) = (8-54) 

The desired density distribution is of an exponential type. The 
values of X and n e.an l)e det(n-iiiinod by direct substitution of f{x) in 
the constraint relations: 

dx = I 

/o dx = a (8-55) 

Note that X must not be positive; otherwise the probability constraint 
cannot be satisfied. Based on this remark, the above eciuations yield 



(8-50) 

(8-57) 


The extremal entropy has a value of 


e In - e dx 
a 


HiX) = - Jj I 

= Ino f’ -e-^i-dx + - - 

Jo a a Jo a 


e-*/** dx 


= In a -|- 1 


(8-58) 



CONTINUOUS CHANNEL WITHOUT MEMORY 281 

Thus the maximum possible entropy for all continuous random variables 
with prespecified first moment in [0, ^ ] is 

H{X) = In oe 

The logarithm is here computed to the natural base. 

Let f{x) be a one-dimensional probability density function, and let 
the random variable X have a preassigned standard deviation a and zero 
mean. Which function f{x) gives the maximum of the entropy H{X)'! 

Following the outlined x)rocedure of the calculus of variations, one 
would maximize a liiu^ar combination of the constraints through evalua- 
tion of the constant multipliers of these constraints. To be specific, 

/(•»•) 1 

( 8 - 59 ) 

H{X) = - j"^f{x)\nf{x)dx 
According to the previously mentioned technique, 

- (/ In /) + inf) + (Xr.^f) = 0 

-(1 +lnf)+M + Xx^- = 0 
f(x) ~ 

But j (lx — 1 

j dx = 

The latter equations yield 

e"-* yj-l= I (8-65) 

\ 2w a 

Finally, f(x) = — (8-66) 

V 27 r cr 

Among all one-dimensional density distributions with prespecufied second- 
order moment (average power), the gaussian (normal) distribution pro- 
vides the largest entropy. The maximum value of the entropy can be 
found directly. 


(8-60) 

(8-61) 

(8-62) 

(8-63) 

(8-64) 



282 


CONTINUUM WITHOUT MEMORY 


ln/(a;) = — In y/2w a — (8-67) 

H{X) = -)-(“ In V^adx -I- r — dx 

J-^y/2ara 7 _ . \/2ir a 2<r' 

( 8 - 68 ) 

H{X) = \n y/2ir a + ^ = In ■\/2v ^ (8-69) 

The maximum entropy in natural logarithmic units is 

EiX) = In (g-^VO) 

The above three maximization problems can be generalized in a direct 
fashion to the case of multidimensional random variables under similar 
constraints. For example, it was shown by Shannon that in n-dimcn- 
sional distributions, when all second-order moments are preassigned and 
the different variables are mutually independent, the maximum entropy 
will correspond to an n-dimensional gaussian distribution. 

8-8. Gaussian Noisy Channels. As an example of the application of 
the preceding material, consider a continuous channel where the trans- 
mitted and th(' received signals have a joint gaussian density distribution. 


- 2,v:d/r- p’ f ■ (& " "" 5; + $)] 

IpI 1 (8-71) 

The marginal densities can be obtained directly as 

1 


/i(^) = 


y/ 27r O’* 
1 

\/ 2ir (Tj 


exp 


(-S) 


(8-72) 


/^(y) =-7^e^p(-$) 


The application of the defining equations for entropies yields 


R{X) = 

— j fi{x) ln/i(x) dx = 

In y/ 2Tre a* 

(8-73) 

H{Y) = 

~ f ^y = 

In y/2Tre <Ty 


H(X,7) = 

~ j „ f ^ /(a;,2/) In f{x,y) dx dy 

r /•-. 


= 

= y j In 2Tr<Tx(Ty V 1 

— dx dy 


+ 

/-. /-. 2(1 - p=) ( 

%-2,^+f)d^dy 

(Tjfyy ^y ) 

(8-74) 



CONTINUOUS CHANNEL WITHOUT MEMORY 283 


The double integral of the last equation is the sum of three second-order 
moments, that is, 


2(1 


- P^) 


E{X^) - ^ E{XY) + E(Y^-)] 


2p 


where /xn = E{XY). 
Now, recall that 


2(1 - pO V / 


Thus 


p = correlation coefficient = -- 


II{X,Y) = In 2T(r..T„ \/l - p' + , 1. 


2(1 - i- 

= In 2w(Xzaye \/ 1 — (8-/()) 


The mutual information in this channel is 


/(X;Y) = f/(X) + /I(Y) - H(X,Y) 

= In \/27rC ax + In \/27rc a,, — In 'liraxOuV V 1 ” 
or I{X]Y) = -a In (1 - p*) IpI 5^ 1 (^-’^7) 

This ('(luation indicates a measure of traii.sinforinalion for the saussian 
channel. The transinformation depends solely on the correlation coef- 
ficient between the transmitted and the receivi'd signals. When the 
noise is such that the recenved signal is ind('p»mdent ot the transmitted 
signal we have 

p = 0 and HX;Y) = 0 


When the correlation coefficient is increased, the mutual information will 

increase. . . 

8-9. Transmission of Information in Presence of Additive Noise. 1 he 

rate of transmission of information in a channel may be defined as t le 
mean or the expected value of the function : 


7 


= log 


/(X,Y) 

MX)MY} 


(8-78) 


That is, 

E(I) = I(X;Y) = 1’^ fM log 

The rate of transmission in bits provides a measure for the average 
information processed in the channel in the sense descri e 
The maximum of I{X;Y) with respect to all possible input probability 
densities, but under some additional constraints, leads to t e concept o 
channel capacity in continuous channels. In other wor s, in con ras 



284 


CONTINUUM WITHOUT MEMORY 


with the discrete case, the channel capacity in the continuous case is not 
an absolute quantity but depends on the constraint. The evaluation of 
the channel capacity is generally a difficult proVjlem, and no general 
method can be given to cover all circumstances. However, in certain 
special cases we are able to study the rate of transmission and evaluate 
its maximum. One such case which is also of much practical significance 
is the channel with additive noise. 



Fig. 8-3. An illustration of the performance of a continuous (ihannel in the presence 
of additive noise. 

Let X be the random variable describing the transmitted signal and 
Y the received signal. We assume that the noise in the channel is 
additive and statistically independent of X : 

Y = X + Z 

f^{z + x\x) = <t){z) (8-80) 

where Z is a random variable, with a probability density function <l){z). 
Equation (8-80) suggests some simplification in the relations among dif- 
ferent density functions associated with X and Y. In fact, reference to 
Fig, 8-3 suggests that 

P\z^ < Z < Z[s + dz^\ = ipizo) dzo (8-81) 

Plyo <Y<yo + dyo\X = Xo\ = P{zo <Z <zo + dzo\X = Xo] (8-82) 

This conditional probability function is independent of xo and depends 
only on the noise structure. Therefore, 

Plyo < Y < yo + dyo\X = Xq] = P{zo < Z < Zo + dzo\ 

= <t>{zo) dzo (8-83) 



CONTINUOUa CHANNEL WITHOUT MEMORY 


285 


The conditional probability density /i(j/|x) and the noise density function 
<f,{z) are of identical structures. That is, in our familiar notation, 

frivlx) = /x(x + z\x) = <#)(z) (8-84) 

The identical nature of the two probability functions here suggests 
identical entropies: 

H{Y\X) = H(X + ZIX) = II (Z) (8-85) 

Thus the transinformation becomes 

7(X;F) = II(Y) - H{Y\X) = II{Y) - II{Z) 

= II (Y) — </>(^) log <i>{z) dz (8-80) 

(For a schematic presentation, see Fig. 8-4.) 

The channel capacity can be deter- 
mined by finding the maximum of 
J{X]Y) in Eq. (8-80) with resx)ect 
to all possible probability density 
functions /i(:c) subject to certain 
reciuired constraints. The most com- 
mon types of constraint are those 
limiting the peak or the average value 
of signal power at the transmitter (see 
Sec. 8-0). Such problems are gen- 
erally tedious and require lengthy 
treatment. However, with appropri- 
ate further assumptions the problem 
may be simplified. An example of 
such a simplification is given in the next section. 

8-10. Channel Capacity in Presence of Gaussian Additive Noise and 
Specified Transmitter and Noise Average Power. The following con- 
straints are required : 

j fi{x) (lx = 1 

xfi{x) dx = 0 

j x*/i(x) dx = (Ti® 

Equation ( 8 - 87 ) corresponds to the assumption that the noise is normally 
distributed with zero mean and average power <Tg^. Equation (8-88) 



(8-87) 

(8-88) 


HiY) 



Fio. 8-4. A sot-theory interpretation 
of the diagram suggests 

/(X;F) = H(Y) - //(Z) 

(All entropies referred to are assumed 
to be finite and positive.) 



28G 


CONTINUUM WITHOUT MEMORY 


requires the transmitted signal to have zero mean and specified power 
but is otherwise unrestricted. (The constraints on the mean of X 
and Z are included for convenience but they are not essential in the 
subsequent developments.) 

According to P]q. (8-86), we have 

7(X;F) = H{Y) - H{Z) = H(Y) - ^ In 2irea,^ (8-89) 

The additive structure of noise in the channel provides us with a simpler 
means of computing the channel capacity, namely, 

max 1{X]Y) = max [//(F) - //(Z)] (8-^0) 

Observe that /(X;F) in the above form is only indirectly dependent on 
/](a:), the probability density at the input. The channel capacity may 
be computed by maximizing //(F) under the previously mentioned con- 
straints. Let us first compute the mean and the standard deviation of 
the signal at the output. 

E{Y) - E{X + Z) = E{X) + E{Z) = 0 (8-91) 

E{Y^) = E{X -f zy = /;(A^) -b /;(Z-) = ^ 7 ,^ + = const (8-92) 

The problem of finding the channel capacity is now reduced to finding a 
probability density distribution having the following prespecified mean 
and average power, respc^ctively: 

0, (7x^ + (Tz^ 

It was proved in Sec. 8-7 that, when the standard deviation is pre- 
assigned, the density distribution with the largest entropy is a normal 
distribution with zero mean [Eq. (8-66)]. The maximum obtainable 
entropy for //(F) in the considered class of input distributions is 

//(F) = In + aj) (8-93) 

Thus the channel capacity is 


C = max /(X;F) = max [^(F) - ^2 In 27rC(7r] (8-94) 


C = yi In 

(Tz 


— 


Finally, letting* 

.S = <7x2 

N = 



yields C = ~ In | 

(- 1 ) 


(8-95) 


* If X is a random voltage applied to a unit resistor and 

X = 0 = S 

then the expected value of the instantaneous power delivered to the resistor is 

E{X^) = = 5 



CONTINUOUS CHANNEL WITHOUT MEMORY 


287 


This remarkably simple result (due to Shannon) gives the channel 
capacity as one-half of the logarithm of the ratio of the average signal 
power at the output to noise power. Note that Y is the sum of two inde- 
pendent random variables and Z. Since Z and Y are normally dis- 
tributed, one can show that X will also be normally distributed. To 
sum up, under additive gaussiari noise and average power limitation on 
noise and on the output signal, the input and the output signals both 
must have normal dist ributions in order to give the higln^st rate of trans- 
mission of information. These considerations can be generalized to the 
case where the input and noise are multidimensional random variables. 
This will be discussed later. 

8-11. Relation between the Entropies of Two Related Random Vari- 
ables. In many physical problems we deal with systems whose input- 


output ndationships are spc^citied. 
In problems of probabilistic origin, it 
is often of interest to investigate the 
entropy of the output in relation to 
the entropy of the input, ("onsidcr 



Fkj. 8-5. An ovMinplc of the tnirhs- 
formntion of nindoni viiriiibles. 


the case where the output is a monotoiiic function of the input: 


Y = g{X) 


According to Sec. 5-11, the probability density pin) is given by 


p(//) = = .W0y))ltA'0y)l 

where the unicpie function stands for the inverse relationsliip between 
input and output, that is, 

X = ^(F) 


The entropy of the random variable Y describing the output of a trans- 
duccr is found to be 

HiY) = - pin) log pin) dn 

= - j" fiHy))\'l''iu)\ )ogfiypin))dij 

-A'[-log/|j(f)|] (8-9«) 

•H^'/y) stands for the so-called Jacobian of x with respect to y. The 
capital letters, as usual, denote random variables. Finally one finds 

HiY) = HiX) - E j^log I J ^y) I] 


(8-97) 




288 


CONTINUUM WITHOUT MEMORY 


Thus the entropy at the output is equal to the entropy at the input less 
the average value of the logarithm of the absolute value of the Jacobian 
of the output with respect to the input. The above simple procedure 
may be applied in a direct fashion to show that the transinformation 
remains invariant under some quite general transformations. In fact, 
let X be subjected to a monotonic transformation Z = g{X)] according 
to the material of Sec. 5-12, we have 


dz 

dx 


HiZ) = H(X) + j Mx) dx log 
H(.Z\yo) = H(X|j/o) + j fx{x\yu) dx log 


dz 

dx 


Note that 


H(Z\Y) = HiX\Y) + J h(y,)dy, j rfx log 

= HiX\Y) + jj f(x\ya) dx dyo log 
After the ne(;essary calculations, one finds that 


dz 

dx 


dz 

dx 


U{Z) - H{Z\Y) = H{X) - HiXlY) 

The procedure may be extended to the case of a transformation of two or 
more random variables. 

An analogous situation holds if the transducer is a multiport, that is, 
X and Y are finite random vectors (multidimensional random variables). 
Under the conditions described in Sec. 5-11, we are able to compute the 
probability density function of the output. Using the Jacobian symbol, 
the result of Eq. (5-76) can be rewritten as 


p ( 2 / i , 2 / 2 , . . . ,yn) = ( 8 - 98 ) 

I \yij!j2f ■ . . ,2/n/ I 

The output entropy for this multiport is* 

H{Yi,Y,, . . . ,Yn) = -£[log p(Fi,F*, . . . ,F„)1 

= H{X^,X„ . . . ,X„) - [log I J I] (8-®^) 

In the particular case when the transducer achieves a simple linear 
algebraic operation, that is, 

[Y] = [A][X] 


* The entropy of a multidimensional random variable will be discussed shortly 
more detail. We trust that the reader will not be inconvenienced by the injection of 
this paragraph slightly ahead of schedule. 



CONTINUOUS CHANNEL WITHOUT MEMORY 289 

the term \ J{Y /X.)\ is equal to the absolute value of the determinant A of 
the transformation, and the Jacobian J{X/Y) is equal to 1/|A| since 

I (?) I ^ "" 1 

In this case the relation between the two entropies assumes the following 
simpler form: 

//(Fi,F2, . . . ,Fn) = /f(A"i,X 2 , . . . ,Z„) + log |A| (8-101) 

An example of the application of this formula occurs when the entropy 
of a random vector is known and a linear transformation of coordinate 
axes takes place. For instance, for all distance-preserving transforma- 
tions, such as the rotation of a coordinate system, the entropy of the 
input remains unchanged. The extension of the above to the case of a 
general linear system where input and output are related by a linear 
differential or integral equation reejuires much care. For such systems, 
the validity of many of the known results should be more closely 
examined. 

8-12. Note on the Definition of Mutual Information. A more precise 
definition of the communication entropy of a source and channel has been 
subsequently used by the Russian mathematicians A. N. Kolmogorov, 
A. M. lagloin, and I. M. Gerfand.* While this definition has a more fun- 
damental appeal than the one discussed earlier in this chapter, it relies on 
more complex mathematical tools, not generally available in engineering 
courses. For the sake of reference this definition is included here. How- 
ever, its understanding is not mandatory in an introductory presentation. 

Let i and f be random objects (vectors, functions, or even generalized 
functions) defined over X and F with appropriate probability distribu- 
tions (not necessarily continuous densities). 

i\\A\ =p[ieA\ 

I\{B\ =P{rjeB} (8-102) 

PjJC'l =P{a,v)EC\ 

The quantity of information in the random object £ relative to 7} is defined 
as 

* The above presentation is based on A. N. Kolmogorov, On the Shannon Theory of 
Information Transmission in the Case of Continuous Signals, IRE Trans, on Inforin. 
Tkeory, vol. IT-2, pp. 102-108, December, 1956. See also A. N. Kolmogorov, 
A. M. laglom, and I. M. Gel’fand, “Quantity of Information and Entropy for Con- 
tinuous Distributions,” report given at Third All-Union Mathematics Conference, 
1966. 



290 


CONTINUUM WITHOUT MEMORY 


The mathematical superiority of this formula to the one originally 
employed by Shannon lies in the fact that it is of a much more general 
nature. The following properties for the mutual information function 
can be established when tj, and f are random objects of a fairly general 
nature. 


> 0 


(8-104) 


Ecluality holds if, and only if, f and t? are independent objects. ; If 
and (£2,^2) are two independent pairs, then 

^((^ 1 ,^ 2 ); (vhVi)) = + H£2;v2) \ 

I((i,v) ;f) = if, Jiiid only if, the conditional distribution of f dependKS 

only on 17 for a fixed f and rj. 


PROBLEMS 

8-1. A Hourco transmits pulses of constant duration but different heights. The 
h(‘ight of a pulst! A" varies between ai and 02 volts. Tlu' source is connected to a 
channel; at the receiving end the height of the pulse can be considcM’ed as a randoin 
variable V varying between bi and 62 volts. The joinl probaliilily density of A" and Y 
is 

/„■) 

(a) Determine the source entropy HiX). 

{b) ])et(‘rmine the entropy HiV). 

(c) Dcteriniiie the entropy H{X,Y). 

(r/) Determine the transinformation /(A';)^) and discuss the results. 

8-2. A' is a random variablt* uniformly distributed betwecai -hi and —1. 

{a) Find the iiu'ans, the standard deviations, and the correlation coeffici(‘nts 
between the following variables: ( 1 ) A'' and A'* and ( 2 ) A and A'^. 

(b) Calculate the eiit-ropy associated with each individual randoin variable. 

8-3. Consider a continuous communication system where the joint probability 
density is described as 

a>fc >0 

= 0 elsewhere 

(а) Determine the source entropy. 

( б ) Determine the equivocation entropy. 

(r) Determine the entropy at the receiver. 

(ri) Determine* the transinformation. 

8-4. Let A' and tlie cartesian coordinates of a point M, be independent randoio 
variables uniformly distributed in 10 , 11 . 

(a) Study the marginal and the joint distribution of 

/i = (X> + and <l> = tan-> ^ 



291 


CONTINUOUS CHANNEL WITHOUT MEMORY 

(h) Determine different entropies for the communicivtion model {R <b). 

(e) Evaluate the transinformation. ’ 

8 - 6 . A continuous channel hiis the foIlowitiR eharaeleristir.s; 

/0/lx) = 

The ini)iit to the channel is a random voltap; with density 

/i(j) = — ^ ,ze 
2flf x/tt 

A' and )" may assume any values from — oo to -f Determine 
(a) The entropy of the source 
(h) Th(' conditional entropy of t he channel 
(c) The transinformation 

8 - 6 . Answer the same (piestions as in Proh. 8-5 for the sources and the channels 
(lescTibed below : 


(a) 

II 

1 

0 < T < 1 


. ()i/C2 — JT - 

■ 0 < 7 < 1 

(f>) 

/i(.r) = r ^ 

0 < .r < X 


/(//Ir) = rc 

0 < // < X 

(c) 

/(X,//) = 1 < 

1 ^ ^ 

2* < DO - < y < X 

X - 



CHAPTER 9 


TRANSMISSION OF BAND-LIMITED SIGNALS 


9 - 1 . Introduction. A major aim of this chapter is to derive and discuss 
the vShannon-Hartley fundamental channel-capacity formula for barlfd- 
limit(d time functions. This formula states that under certain plausible 
conditions the maximum rate of transmission of information for band- 
limited signals perturbed by independent gaussian noise, when the signal 
and the noise 'power are limited to S and iV, respectively, is 

Ci = IT log bits per second (9-1) 

where ( — 27rTf,-l-27rlT) specifies the frequency range of the class of band- 
limited signals under consideration. Since the class of band-limited 
signals constitutes the most important class of signals applied to any com- 
munication apparatus, the above equation forms a central theme for the 
study of optimum performance of communication devices transmitting 
continuous messages. While this well-known formula is astonishingly 
simple and intuitively could be accepted as a direct extension of Eq. 
(8-95), its derivation is rather complicated. Actual derivation of 
Eq. (9-1) is based on a number of assumptions which must be carefully 
examined and some mathematical developments which require detailed 
attention. While the information-theory content of this chapter will be 
primarily devoted to the derivation and physical interpretation of this 
basic formula, we shall digress and present some basic relevant mathe- 
matical techniques. Adequate acquaintance with these techniques will 
enable the reader to broaden his view and be prepared for undertaking 
similar problems. 

Before proceeding, it is worthwhile to organize our thoughts by making 
a sketch of the development to come. 

1. In Sec. 9-2 we continue our study of continuous channels without 
memory, when the input is a multidimensional random variable. 

2. Section 9-3 presents the maximum rate of transmission of informa- 
tion for a class of multidimensional random variables perturbed by 
independent gaussian noise under certain power constraints. 

3. Sections 9-4 to 9-7 are devoted to building a bridge for transition 
from a class of continuous signals to a class of multidimensional random 

292 



TEANSMISSION OP BAND-LIMITED SIGNALS 


293 


variables. The ensemble of continuous signals forms what is called a 
stochastic process. There are generally an infinite number of random 
variables involved in such signals. We therefore have to devise certain 
mathematical techniques enabling us to reduce a problem of such com- 
plexity to a problem dealing with finite multidimensional random varia- 
bles. These sections present the plausible assumptions under which Eq. 
(9-1) holds. Thus we have transformed the problem of communication 
of band-limited continuous signals over noisy channels into the study of 
entropies associated with a multidimensional random variable. 

4. Section 9-8 is somewhat of a digression. There we present the ele- 
ments of an important mathematical tool, namely, the theory of normed 
vector spaces. In the long run, the reader will realize the impact of this 
important concept in many communication problems. As an immediate 
application of the idea of vector spaces, a presentation of the familiar 
Fourier series and sampling theorem will V)e given. The patient reader 
will be rewarded later with a deeper understanding of Shannon’s geometric 
model of the encoding of continuous messages. 

5. In Sec. 9-13 some possible encoding procedures will be discussed for a 
geometric model of commun legation of continuous messages. It will be 
shown that under favorable circumstances one may transmit at a rate 
arbitrarily close to Ct^ as described in Eq. (9-1). 

9-2. Entropies of Continuous Multivariate Distributions. The concept 
of the entropy of a continuous single variate was presented in some detail 
in Chap. 8. This concept can be directly generalized for defining the 
entropy associated with a multidimensional continuous random variable. 
For instance, let X stand for an n-dimcnsional multivariate 

A = [Xi,A%, . . . ,A„] 
with a probability density function 

f\{XijX2, . . . ,T„) 

The associated entropy is defined as 

. . . ,Xn) = E[- log/i(Xi,A2, . . . ,Zn)] 

/ oo r « r m 

— 

n 

log/i(a;i,^2, . . . ^Xn)dxidx2 ’ ’ ■ dXn (9-2) 

In order to save space, we may write this equation symbolically as 

H{X) = E[- log/i(A)J (9-2a) 

The meaning and the properties of H{X) are similar to those described in 
Chap, 8. In a similar fashion we may describe the transinformation in a 



294 


CONTINUUM WITHOUT MEMORY 


multidimonsional continuous channel without memory. For instance, 
assume the output of the channel to be an w-dimensional random variable 

V = . . . ,FJ 

with corresponding density function 

f2(yh!/2, . ■ . 

Then the (uitropy at the output can be obtained as 

7/(F,,F., . . . ,F.) = F[- log/.(Fi,F,, . . . ,FJ1 (S(-3) 

or, symbolically, \ 

//(f) = El- loK/,(f)] 

The noise characteristic of the channel and the jf)int probability density 
function can l)e similarly defined as an exUmsion of the saine concepts 
in the tAvo-dirnensional case: 

Pfyi < F] < i/i + d/yi, . ]/„ < = .I’l, 

. . . , X, =- .rj = f{y\^) (Iff 

P{yi < F] < vyi + d//i, . . . , ijm < Y„, < y,n + dy,,, 

r\ Xi < X] < Xi + dx\y . . . , Tn < < a”n + d.rnl = fi^jy) d2 dy 

The physical interpretation of the situation is (luite simple. Suppose 
that, say, the heights and the ages of ii group of p(M)ple are being com- 
municated over a noisy channel. At the input we know the two-dimen- 
sional probability density of the heights and the ages of th(‘ group. Then 
the input-output joint density is a function of lour variables 
2 /i, 2 / 2 ). The following relations are self-explanatory: 

IiixiyX‘2) = I /* f {xiyX2] y\yih) dy I dy 2 

Jyi= y j/2= - * 

hiUhVi) = f I f{jri,xu d.ri(i.C2 

Each set of transmitted data is independent (^f-the previously trans- 
mitted set (no memory involved) ; however, Xi and .r 2 of each particular set 
may be interdependent. In general, if the noise effect on a particular 
sequence, say the kth [.Ti,X 2 , . . . ,:r«], is independent of the noi.se effect 
on any other transmitted sequence, say iheyth [.ri,.r 2 , . . . ,a\d, then we 
say that the transmitter has no memory. In the opposite case the trans- 
mitter exhibits a memory. The mathematical study of channels with 
memory is rather involved. In Chap. 11 we shall discuss briefly discrete 
channels with memory, but presently we shall confine ourselves to sources 
and channels without memory. In the light of the previous discussion, 
the transinformation in channels without memory can be symbolically 



TRANSMISSION OF BAND-LIMITED SIGNALS 


295 


written as 

It is possible to derive formulas for channel capacity under certain 
constraints similar to those discussed in Chap. 8. A particular case of 
such constraints is discussed in the following section. 

9-3. Mutual Information of Two Gaussian Random Vectors.* In 
many problems of the physical world we are interested in investigating 
mutual information of two complex random phenomena. Each of these 
phenomena may be expressed by a random vector, that is, an n variate. 
In this section, we investigate the simplest and perhaps most frequent 
case Avhen two random vectors are normally distributed. The following 
derivation of the mutual information conveyed by a multidimensional 
gaussian random variable about, another such variable is due to Gcrfand 
and laglom.t Let X = . . . ,Xn] and Y = [Y^,Y 2 , . . . 

be n- and m-dimensional normal random variables (random vectors), 
respectively. Let Z be the random vector describing their j oiiit behaviors. 

Z = [X;Y] = [Xi,X2, . . . . . . Jm] 

Without loss of generality, we assume 


X, = 0 /c = 1, 2, . . . , n 

Yi; = 0 /c = 1, 2, . . . , m 


According to Sec. 7-2, the n-dimensional normal probability of X can 
be written as 


1 

(27r)"/2(det 


exp 


(9-5a) 


where [A] is the moment matrix; that is, its elements a„ are defined as 


a^J = moment X^Xj 


This may be written symbolically as 

o,j = Jx,x,f{z) dz = /x,i,/i(x) dx 

fi{x) symbolically refers to the n-dimensional probability function of x. 

* See S. Kullback, “Information Theory and Statistics/’ chap. 0, John Wiley & 
Sons, Inc., New York, 1959. The proof of this section may be skipped in a first 
reading. The proof of the statement, although basically simple, assumes familiarity 
with quadratic functions, the partitioning of a multinormal variate into two sets, 
and the determination of their partitioned covariance matrix. 

1 1. M. Oerfand and A. M. Inglom, Calculation of the Amount of Information 
about a Random Function Contained in Another Such Function, Uspekhi Mat. Nauk 
S.S.S.R., new series, vol. 12, 1957. English translation in I'rans. Am. Math. Sac,, 
ser. 2, vol. 12, pp. 199-246, 1959. See also A. N. Kolmogorov, On the Shannon 
Theory of Information Transmission in the Case of Continuous Signals, IRE Tram, 
on Inform. Theory, vol. IT-2, pp. 102-108, December, 1956. 



296 


CONTINUUM WITHOUT MEMORY 


Similarly, for the distributions of Y and Z we have, respectively, 

^ (9-56) 

(det (7)« [~J'^(C' ' 2 , 2 )] (9-5c) 

where the elements of [B] and [C] are defined, respectively, as: 

K = iy^yMy) dy 

= /2,2j(z) dz ; 

1 

The moment matrix [( 7 ] for the joint gaussian distribution is found ti be 



where t stands for matrix transposition and the elements of [D] are 
defined as 

dij = dz 


(yl, and D arc assumed to be nonsingiilar matrices.) 

In order to compute the mutual information between the two random 
vectors one has to employ Eep (9-4) . To this effect, we write symbolically 




. 1 ^:® _ 1 \(C~'z z) 

det C 2 ' 


{A-'x,x) - {B-%y)] 
S(z)dz (9-6) 


But note that 


f(A~^x,x)f{z) dz = J{A~^x,x)fi{x) dx = n 
/(B“'y,2/)/(2) dz = m (9-7) 

f (C~^z,z)f(z) dz = n + m 

Finally the mutual information becomes 

I{X-,Y) = I log (9-8) 

This compact formula could be of considerable use iiLapplication problems. 

The mutual information of two gaussian random vectors can alter- 
natively be expressed in terms of their correlation coefficients. 

/(X;F) = - Hlog (1 - pm - P2^) ■ • • (1 - p,“) (9-8a) 

where p, is the correlation coefficient between Xj and Y, and I = min 

A formal proof of this statement can be established subject to an 
appropriate transformation of random variables. The reader who wishes 
to forgo such an exercise may satisfy himself by considering this equation 
as an extension of Eq. (8-77). 



TRANSMISSION OF BAND-LIMITED SIGNALS 


297 


9-4. A Channel-capacity Theorem for Additive Gaussian Noise. In 

order to obtain a simple formulation for the maximum transmission of 
information, we make the following plausible assumptions. These 
assumptions are of a practical nature, and they lead to a simplified 
mathematical formulation. 


1. Let the input and the output be n-dimensional vai iaies .V and F, 
respectively, with 


JC, = 0 


/c = 1, 2, . . . , 71 


2. Let the noise 2 also be an n-dimensional normal variate willi 



A: = 1. 2. 


. , n 


Covariance of (Z* and Zj) = 0 for k 9 ^ j (independent components). 

3. The noise is assumed to be additive, that is, 

P = 1 + Z 

With these assumptions, we compute the transinformation and obtain 
the associated channel capacity. First, one can compute the entropy 
1I{T\X) similar to Eq. (8-84). In fact, 

fAy\x) = 

where (p{z) is the density of noise. 

ft 

The entropy associated with Z is the sum of the entropies of each of its 
components since the components are statistically independent; thus 


(9-9) 


(9-10) 


//(f|X) = - p ,p(2) log dz = H{2) = y log (9-11) 

k=l 

The transinformation is 


7(Z;P) = H{?) - H{?\X) = Hi?) - //(Z) 

n 

I{X-,?) = H{?) - 2 log V2TC<r,. 

* = 1 


(9-12) 


In order to find the channel capacity, we need to maximize 7(^;P) 
under assumed constraints. This coincides with the maximization of 



298 


CONTINUUM WITHOUT MEMORY 


the entropy H{f), Therefore, we examine more closely the random 
vector F. Each component of X is affected by an independent gaussian 
perturbation, that is, 

F/t = X, + Z, 

Furthermore, we have specified that has a distribution density with 
zero mean and given standard deviation. Therefore, according to Eqs. 
(6-10) and (6-24), F* has a distribution with zero mean and standard 
deviation such that 

~ (i^-13) 

In order to make //(F) a maximum we note that the entropy associ\j-ted 
with a multidimensional scheme is greatest when: 

1. All the dimensions are independent random variables. 

2. Each dimension has the greatest entropy under the specified con- 
straint (Sec. 8-7). 

Condition 1 implies independence of sampling points and con- 
dition 2 recpiires that each sample have a gaussian distribution (see Sec. 
8-7). Thus we are led to the case where the received signal Y has statis- 
tically independent components each normally distributed (with zero 
mean and the specified standard deviation). The maximum possible 
transinformation under these conditions is 


^ log V2irf (Ty, - ^ log <7,, 

A: = l k = i 

n 

V 


k = \ 

('+£>) 


(9-14) 


^ = 1 


This formula obviously gives the channel capacity under the specified 
constraints. The source X a(;hieving such a rale of transinformation 
for the channel under discussion is also an independent n-dimensional 
gaussian source as its density can be directly obtained by the convolution 
of two such densities {Xk = ijk ^ Zk). 

A final simplification is required before reaching the compact formula- 
tion of Eq. (9-1). We may assume that the variances (power) for all the 
Xk are identical, also that the variances of the noise samples arc equal. 




O'** “ 


fc = 1, 2, . . , , n 



TRANSMISSION OF BAND-LIMITED SIGNALS 


21>9 

In this case the maximum of the transiiiformation is 

n 

^ J log (^1 + (9-15) 

fc= 1 

In Sec. G-3, it was pointed out that, when X ~ 0, the variniiee 
represents the average power dissipated in a unit resistor under the 
application of a random voltage X. Thus a/ and <7--^ can be replaced by 
S and Nj respectively, the signal and the noise power. 

IUX;Y) = ” log 

The engineering significance of this eciuation lies in the fact that it gives 
an upper limit for transinformation in a communication channel under 
“reasonable” assumptions. Turthermoiv, the value of that upper bound 
is described in trains of signal and noivse power which can both be meas- 
ur(‘d in the laboratory. 

9-6. Digression. While a continuation of the more basic approach to 
the mathematical study of multivariate channels with or without memory 




Fig. 9-1. A continuous random signal and its quantized form. 


is desirable, at this point we wish to digress and assume another direction. 
The slight detour of this section is of much enginficring interest in the 
study of communication systems. Because of this detour, we shall avoid 
Ifaveling a mathematical path which, as yet, remains to be paved. 



300 


CONTINUUM WITHOUT MEMORY 


Meanwhile we shall have an opportunity to see the machinery which has 
already been supplied for the future building of such a road. 

In order to appreciate the engineering problem which has forced us to 
this detour, consider what comes to the mind of an engineer thinking of a 
continuous signal. A random continuous signal f{t) is represented in 
Fig. 9-1 ; its value at any instant of time is unpredictable. That is, at 
each instant tkj f{tk) is a random variable. The signal is a member of a 
class of signals that are referred to as stochastic. The study of such 
sources requires the statistical knowledge of generally nondenumejrably 
infinite numbers of random variables such as/(^A:). This is a tremendous 
task and considerably beyond our present scope. However, on\ our 
detour two important steps are emphasized that will help the reader to 
obtain some useful results, namely, a simplified version of this complex 
problem. These steps are as follows: 

1. Describe a class of continuous signals, the study of which can be 
‘‘practically’’ reduced to the study of a discrete problem. 

2. Once the problem is reduced to a discrete case, it may be further 
reduced by plausible approximation to sourcivs of the type of finite 
multidimensional random variables. Thereafter, the methods of Sees 
9-2 and 9-3 may be employed. 

For step 1 we select the class of band-limit(‘d signals, since they are, in 
a sense, “ecjui valent” to a class of denumcrably infinite random vari- 
ables. This is facilit/ated by the use of the sampling theorem (Sec. 9-G). 

Step 2 reejuires some “engineering approximation,” allowing a furtlu^r 
simplification of the problem to finite-dimensional variates. This is 
done by exploiting the concept of signal space (Secs. 9-8 to 9-12). 

9-6. Sampling Theorem. In all communication e(|uipment we deal 
with a limit(‘d frecpiene.y range. For example, we may apply some elec- 
tric signals to a two-port filter with a 
transfer function T{S). The plot ol 
the magnitude of T{S) is generally 
limited to a frequency range, say 
( — ojoA'o), as illustrated in Fig. 9-2. 
Of course, the band-limitation state- 
ment is meant to be plausible rather 

~^o % than rigorously correct, in the strict 

Fici. 9-2. A band-limited signal. mathematical sense of the word, as 

\T{jo))\ for an RLC lumped system 
cannot be identically zero for lw| > |wo|. But the transmission characti i- 
istics of all “physical systems” for all “practical purposes” vanish 
“ very large frequencies.” Therefore, it is reasonable to confine ourselvt*^ 
to the class of all signals such that their Fourier integrals have no 

frequency content beyond some range (— wo,+wo). More specifically, 




TRANSMISSION OF BAND-LIMITED SIGNALS 


301 


a signal f{t) is said to be band-limited when 

FUoi) = = 0 for jo;! > la,„| 5 ^ 0 (9-1(5) 

The frequencies of the human voice are generally between a few cycles 
and 4,000 cycles per sc(;ond. The freciuency range of the human eye 
and ear is also limited. The bandwidlhs of telephone, telegraph, and 
television are other examples of band-limited communication equipment. 

While it takes a continuum of values to identify an arbitrary continuous 
signal in the real interval [ — oo ,+ <»], we shall show that the restriction of 
Eq. (0-1 ()) will reduce the identification problem to that of a rc^al function 
specified by a denumei ablc set of values. The sampling theorem below 
states that, if a signal is band-limited, it can be completely specified by its 
values at a sequence of discrete points. This theorem serves as a basis 
for the transition from a problem of continuum to a problem of discrete 
domain. 

Theorem. Let f{t) be a function of a real variable, possessing a band- 
limited Fourier integral transform F(jco) such that* 

F{joi) = 0 for \(x>\ > |coo| 

Then /(O is completely determined by knowledge of it s value at a sequence 
of points with abscissas equal to Trn/ojo, 

W = [. . . ,-2,-1, 0,1, 2, . . .] 

Furthermore, f{t) can be expressed in the following form: 

OD 

V . /imN sin - irrt/wo) co 17 ^ 

JW - a,«(« - W'a-o)”” ^ ^ 

n - — * 

Proof. Consider the pair of Fourier integrals: 

F{joi) = I dl (9-I80) 

fit) = ^ F{jo,)e^< do, (9-186) 

Obtaining the Fourier series expansion of the function F(jw) in its funda- 

* For mathematical convenience we confine ourselves to cases where F{jw) does not 
contain delta functions. Such restrictions could be removed by special mathematical 
considerations. Ffjto) may contain delta functions but not at points w =* ±wo. The 
theorem is alsQ valid when the interval is not necessarily centered at the origin. 



302 CONTINUUM WITHOUT MEMORY 

mental period of 2a;o yields* 


^ij^) = ^ for |ci;| < |cool 

= — * 

where Cn are Fourier coefficients, that is, 


1 /■““ 
Cn = .3— / 

JoJO J-a 




(9-19) 


(9-20) 


Equation (9-186) suRgosts the following values for the Fourier ^ries 
coeflicients in E(i. (9-20): 


Thus 


/(- = .y r PL 

\ coo/ 

(- 




Wo 


irn\ 

Wo/ 


(0-21) 


Fhese (equations show that the Fourier coefficients are completely deler- 
mined by a knowledge of the values of the original function /(/.), sampled 
at intervals of time tt/wo apart. Thus F(/w) is uniquely determined by a 
knowledge of the values of the sampled ordinates. This, in turn, guar- 
antees the uni(|ue detcTmination of /(t) through Ecj. (9-18/;), as Fourier 
integral pairs uiiiciuely determine each other, t 

In order to prove the identity of Eq. (9-17), we note that the right- 
hand member of the equation is a time function which assumes the value 
of /(7rn/wo) at time t = irnlwo. Indeed, all t,he terms of the summation of 
Eq. (9-17) vanish for t = ±(Trk/o)o)j fc = 1, 2, . . . , except for k = n, 
for which 

sin wo(7rn/wo — Trn/wo) 


I 


( 7rn\ si 
wo/ 


W()(7rn/wo — Trn/wo) 




(9-22) 


Thus the right-hand member of E(p (9-22) coincides with f{t) at the 
sampling points. But according to the first part of the theorem, proved 
earlier, the function J{t) is completely determined through its values at 
these sampled points, whence the identity of the two sides of Eq. (9-17) 
is proved. The symbolic, equation (9-23) serves as a reminder of the 


* The reader is assumed to be familiar with Fourier series and integrals Fourier 
scries expansion of a real-valued function is quite common. By th(‘, FouriiT series 
expansion of F(ja) = yl(aj) -\-jB(o)), we mean the sum of the Fourier expansions of 
A and B in the same interval. Note that Cn and r..„ are not necessarily conjugate for 
all n unl(‘S8 F{jo)) is a real-valued function. 

t As a more direct proof, substitute for Cn in Eq. (9-19) its values taken from 
(9-21); then substitute this result in Eq. (9-186), in order to establish the identity of 
the two sides of Eq. (9-17). 



TRANSMISSION OF BAND-LIMITED STONALS 803 

fact that, when the sampling theorem holds, the coefficients of the Fourier 
series expansion of Fijca) lead to the sampling values of /(/). 

^ [. . . ,C_2,C_i,Co,Ci,C2| . . .] 

An equivalent way of expressing the sampling theorem is the following: 
Theorem. Let F(ja>) be a function of the real variabh' co possessing a 
time-limited Fourier inverse integral transform /(/), that is, 

|/(f)| = 0 for \t\ > |/|,| 

Then F^joj) is completely determined by its ordinat(\s at a scfiuencc of 
points with abscissas eciual to riTr/Zo, where 7i assumes the folloAving values: 

[ — 2—1012 1 

Proof. The proof of this theorem is analogous to the proof of the 
previous one. The theorem may be derived from the previous one in 
the following way : 

1. In the hypotliesis of the sampling theorem in the fnapjency domain, 
interchange t with cj, with cuo, and f with F. 


0)0 

TT 






Fig. 9-3. A tiino-limited signal and its Fourier transform. 


2. Then the conclusion of the theorem will coincide with the statement 
of the pr(\sent one. 


F{jo7) 



s^^^ 7rn/ 
^o(o) — Trn/t[)) 


(9-24) 


As an example of the application of the sampling theorem, consider a 
time-limited function /(/) constrained by 

f{t) = 0 for 1^1 > \to\ 

According to the sampling theorem in the frequency domain, F(io)), the 
Fourier transform of /(O, will be completely determined by its values at 



304 CONTINUUM WITHOUT MEMORY 

the doubly infinite sequence 


[ 27r TT ^ TT 2t 

~V~Tr’T,’ V 

For example, if we assume that at these points the function F{ja)) takes 
on the sequence of values 


(. . . ,0,0,2K«o,0,0, . . .] 


then, because of the band-limited character of /(/), F(jaj) will be giv^n by 


- = 2Kt,, - (^25) 


A check of the admissibility of this answer is provided by a consultation 
of Fourier integral tables (or direct derivation). In fact, from Fouriia- 
integral tables one obtains 


fit) = k for 1^1 < l^ol 

fit) =0 elsewher(i 

Thus fit) is indeed a band-limited function, as originally assumed. 

Similarly, wc may consider an ideal low-pass filb'r as an example for 
checking the validity of the sampling theorem in the time domain. Let 


F(jco) = 0 
Re Fijoj) = k 
Im F(jaj) = 0 


for |a?| > |ajo| 

for jco| < jcool k > 0 

for \o)\ < |wol 


The corresponding time function can be obtained from the Fourier 
integral tables (or, in such a simple example, directly) : 


fit) — (j)ok 


sin a)ot 
Tro){)t 


(9-2G) 


Note that, because of the band-limited character of Fijo))^ fit) should be 
completely determined by the following domain and range. 

2w IT IT 2 t 1 

. , ; > U, 7 » * ‘ ■ 

CUO COo Wo W() J 

[. . . , 0 , 0 , ^^. 0 , 0 , . . .] 

This result is indeed in agreement with the expression for fit) suggested by 
the sampling theorem. 



m = 



sin cjoit — 7rn/wo) 
(tJoii — irn/wo) 


= wofc 


sin wq^ 
TTCOot 


(9-27) 



TRANSMISSION OF BAND-LIMITED SIGNALS 


305 


9-7. A Physical Interpretation of the Sampling Theorem. A physical 
interpretation of the sampling theorem can readily be proposed. Sup- 
pose that a continuous band-limited voltage signal v{t) is given. We 
could (luantizc v{t) at times 

27r TT ^ TT 2ir 

— j — f U, — ) - j 

COo ^0 ^0 

The quantized voltages are successively applied to an ideal low-pass 
filter as impulses of appropriate magnitude at the specified times. The 
response of an ideal loAV-pass filter with cutoff at coo to a unit impulse 
t/o(/) is found to be /c(sin cooO/^^, where /.: is the constant of the filter. Thus 



Input 


Fkj. 9-4. Pliysic-al iiitorprotation of the sainpluig thc,orom. 




the total output of the filter, 7'(i(/), will represent the original time func- 
tion v{t) within some scale factor. 

□0 

y ,('"') (9.2g) 

TT Z-/ \C*>0/ — Tm/oJo) 

11= — BO 

If the constant of the filter is tt/coo, then Vo{t) and v(t) become identical. 

The concept of the sampling theorem has frequently been employed in 
communication problems such as the extensive work done at the Bell 
Telephone Laboratories on speech transmission and also by other inves- 
tigators prior to Shannon, such as Nyquist, Kiipfmuler, and Gabor. The 
mathematical statement of the sampling theorem has been made by E. T. 
Whittaker, J. M. Whittaker, and several other mathematicians. Subse- 
quent to Shannon’s use of the sampling theorem, a number of interesting 
articles on this subject have appeared in the literature of engineering as 
well as in mathematical journals (see N-2 of the Appendix). 




306 


CONTINUUM WITHOUT MEMORY 


Example 9-1. Wo illustrate the sampling theorem with the following example. 
Consider a band-limited signal with a triangular frequency distribution, as shown in 
Fig. K9-1. 

I'Vom a table of Fourier transforms (or by performing the integration) we find that 

the time-domain function describing such a 

t |F (; ) I signal is given by 

f/t\ fsin (cjo//2)12 


Now, the sampling theorem states that , 

y (fc) 

^ — — — Z-/ \wo/ wo(t — TnfdJu) \ 

'aJq 0 U)q 

Fhi. 109-1 where /(irr?/ci?o) is the value of fit) atS^ the 

respective sampling points. 

We illustrate the theorem by showing that the summation of Eq. (h) does, in fact, 
yield the function in (a) when th(‘ appropriate values are inserted in Eq. (h). 
First, evaluate /( 7 m /too) at the sampling points by setting t in Eq. (a) etjual to irn/uo. 


/( )=T (^) 

\too/ Jtt \ irni / 

where m = r?/2, n = ±2, +4, ±0, ... 

f (27?/4)7r~|* _ A" too / 4 \* _ Ko)o _4_ , X 

\too / 27r L (2ri/4)7r | 2ir \2Tr77/ 27r 

where n is odd. 

Nextr, evaluate ^/too) sampling point by letting n take on integral 

wait — 7/7r/too) 
values from •— =0 to 

By a trigonometric identity, sin woit — nw/uo) = sin to cos rnr — cos to of sin titt. 
Therefore 

sin tO()(f — n7r/to(i) _ ^ _ ( — 1)" 

too(f — ri7r/too) coof — nir 


A' too 

27r 


(c) 

\sin tttmV 


(d) 

. 77777 / 


’ _ Atoo / 4_ > 

1 * Atoo 4 

(e) 

2ir \2Trr7j 

' 2ir nV 


Equation (h) can now be written 
fit) = sin toof 


2 


Since /(riTT /too) = /( — 7i7r/too) [from lOq. (e)], we can add pairs of terms having the 
same |n| and take the summation from ti = 1 to n = «. The pairwise sums can be 
written 

te„>o w 

\tOo / L ^of ~ + TrTlJ Vcoo / — TT^Tl^J 

Then, substituting Eq. (</) into Eq. (/), 

m . .i„ .rf [i m) + w 2 ^ (=) * 



TRANSMISSION OF BAND-LIMITED SIGNALS 


307 


But from Eqs. (c) to (e) we know the values of fimr/uo) to substitute into (/i). 
Furthermore, we know that /(nir/cjo) = 0 (n even) from Eq. (d), and so we can sub- 
stitute a new index, m, into the summation, where 

n == 2m — 1 (i) 

thus summing only over the odd values of n. 

Performing these substitutions in {h) wo obtain 

eo 

m = 1 

We substitute the following trigonometric identity for sm uot: 

^ . OJof too/ 

sin wot = 2 sin ~ cos -y (k) 

00 

^ Acoo ,-v Wot Wot / I j ^ \ 

~ Z ^^2yr-ni)72i=y“i) 


Performing some algebraic manipulations, wc' got 


Atoosin (too//2) cos (too//2) f 2(a>o/)® 


r, + 2 (wr y _ j 1 1 

L ’r’' Li (2m - iy (m - ii)V - (iool/2y] 


Then by bringing (woO^/tt^ inside the summaLioii and dividing both numerator and 
denominator by h]q. (d), 


. _ Awosiii («o//2) cos {wot/2) 
wo/ 72' ■ 


fi , 2 y - 1 

L Z/ nHm - 'Ayir\m - }iy - (oo«/2)*J 


Rewriting the second term in the summation, using the following identity, 




f{t) = ^ (cjo// 2) c o s {w ot/2) f I I 2 y } 

2ir wd/2 \ Tr^{m — 

m= 1 


Separating the parts of the summation, 

_ Kwo sin (u>o//2) cos {wot/2) 

^ 2ir wd/2 


r ill u) 

LirHm - H)* - (W/2)« Jj 




ir^{m - 14)^ - {w^/2y 


- 5 i 


tan 2 = 22 





308 


CONTINUUM WITHOUT MEMORY 


(see, for instance, E. C. Titchmarsh, “Theory of Functions," p. 113, Oxford Univer- 
sity Press, New York, 1950), which can be rewritten with the substitution m = n + 1. 


tan z 


'■I 

m = \ 


1 


irHm - - z* 


(r) 


Widder* shows 


I 

m = l 


= 1 + — + — 4- 

(2m - 1)2 ^ ^ 32 ^ 52 ^ 


(s) 


Substituting z = W(//2 in Eq. (r) and substituting Eqs. (r) and (s) in Eq. (p), wc ^t 


_ Kdo sin (a)o</2) cos (ajof/2) 


1 + 


tan (wQt/2) 8 ir 


oot /2 


ir2"8 J 


1(0 


and finally 

This solution was derived by several students, in j)articular, S. Rubin, during a 
course on information theory. 

9-8. The Concept of a Vector Space. The use of //-dimeusional r(‘a] or 
complex space is quite common in engineering problems. For instance, 
in the study of signals in communication theories one freciuently employs 
the concept of vector space, although this may not appear in its strict 
mathematical frame of reference. The concept of power content of signals 
in engineering texts is an example of tacitly ('xploitii^g a basic product of 
vector-space theory. It is safe to assume that not all the readers are 
familiar with these concepts. For the benefit of such readers, we include 
this section as a digression from the main stream of thought. The section 
is designed to provide a brief glimpse into vector -space theory. Mean- 
while the use of basic axioms required by a vector space and the properties 
specified for a norm are presented in a way parallel to the treatment of 
some of the material presented in Chap. 2. 

Spaces. A set or collection of elements, generally called points, is 
said to form a space. This space is not what is generally understood by 
geometrical points. This may be a collection of points, or vectors, or 
functions, etc. For example, the set of all functions 

y = X cos a 

where a; is a given real number and a a real variable, forms a space S. 
For each value of the parameter x we have a point in this vspace. The 

* “Advanced Calculus," p. 340, Example B, Prentice-Hall, Inc., I]nglewood 
aiffs, N.J. 



TRANSMISSION OF BAND-LIMITED SIGNALS 309 


points iji = 2 cos a, = — 3 cos a are elements of the set under con- 
sideration. This is usually written in the form 


yi & S 2/2 G 


Vector Space. A vector space is defined in the following way: Let F 
he the set of real or complex numbers. V is said to be a vector space if, 
for every pair of points x and ?y, x ^ V , y E: V, and a number a E F, 
the operations of addition and multiplication by a number are defined so 
that 


x + yE y 
ax E V 


(9-29) 


Furthermore, the following properties of the space are reciuired : 

1. There exists an element ‘‘zero'' in V such that, for each x E F, 

T + 0 = 

2. For every x E V there corresponds a uiii(iue point — x E V" such 
that — + .T = 0. 

3. Addition is associative, for example, 

(x + y) z = X + (y + z) == x + y + z 

4. X y = y X (commutative law) for all x, y E V. 

5. a(x + y) = ax ay. 

(). (a + P)x = ax + fix. 

7. (afi)x = a(fix). 

8. 1 ■ x = X. 

9. 0 • a; = 0. 

It is worthwhile to note here that the above properties are those of 
ordinary vectors as encountered in undergraduate courses in applied 
sciences, and for this reason any set of elements that has these properties 
is called a vector space. In mathematical terminology, V is said to form 
an abstract space over F. 

Example. The most familiar example of vector space is given by an 
ordinary two-dimensional rectangular coordinate, that is, the set of all 
ordered pairs (ai,a2) of real numbers, addition being defined as follows; 
Given 

X = (ai,(i 2 ) y = (hifi 2 ) 


then X + y = (ai + 6i, a2 + 62) and ax = (aai,aa2). 

Vector 0 = (0,0) x = (ai,a2) —x = ( — 01,-02) 

All the properties listed above being satisfied, the set forms a vector 
space. One has to be rather careful here in interpreting the pair (01,02). 
It is familiar from analytical geometry that such a pair represents a 
“point” ill the plane, and so addition of pairs as defined above will be 



310 


CONTINUUM WITHOUT MEMORY 


meaningless in geometrical language. But there is no harm in letting a 
pair (ai,a2), say, represent a vector issuing from the origin in the plane to 
the point (01,02) ; then any misunderstanding can be obviated. 

Linear and Linear Normed Spaces. A space S is said to be linear 
if, for every pair of numbers (a G F, 0 G F) and any pair x, y of 
ax + 0 y G S. In the application of the theory of linear vector spaces to 
physical problems, a most significant type of vector space is encountered. 
This is called normed linear space. Normed linear spaces exhibit a 
natural generalization of the familiar euclidean space. We first dejfine 
what is meant by norm. 

The norm of an element x of T, denoted by ||.x||, is a nonnegative \esi[ 
number satisfying the following properties. \ 


(I) 

IMI = 0 

if .r = 0 

(II) 

Ikli 5^ 0 

if X 7*^ 0 

(III) 

\\x + 2/11 < 

IWI + 112/11 

(IV) 

||o:j|| = |a| 

IWI 


A vector space with a norm is called normed linear space. 

Inner Product. As is known from ordinary vector analysis, the “inner 
product^ of two vectors X = (01,02) and?y = (61, ?)2) in the real two-dimen- 
sional rectangular system is defined as {x,y) = oibi + 02?>2. 

If we adopt the y/{XjX) of a vector as its norm, we have 

(x,x) = Oi^ + a^^ = ||x||2 

If the two-dimensional normed linear space is a c;omplex space, then a 
natural extension of the above definition for the inner product leads to 
the following. Let 

y = (^2,772) 

where fi, t;i, {2, V2 are complex numbers. Then the inner product of 
X and y is 

= hh + ^ 1^2 

where the bar stands for a complex conjugate. The norm of x is defined as 

(x,x) = \\x\\^ = Jill - 1 - 771771 = |ji |2 + \r}i\^ 

Note that when the space is real this more general definition of the inner 
product is obviously valid. A vector space for which the inner product is 
defined is sometimes called an inner-product space. The following prop- 
erties for the inner product are given; 

(I) {x,y) = (^) 

(II) <aia:i + aiX2, y) = a^{Xi.,y) + ai{x2,y) 

(III) {x,x) > 0 (ijx) = 0 if, and only if, a; = 0 


(9-31) 



TRANSMISSION OF BAND-LIMITED SIGNALS 


311 


N -dimensional Real or Complex Inner-product Spaces. We have 
already defined the linear spaces with an inner product. Here we should 
like to reiterate and establish the notation that is commonly used for 
real and complex spaces. 

An n-dimensional real or complex space will be denoted by and C", 
respectively. 

An example of T^Us given by the space of points on a straight line. A 
point X has a norm equal to the absolute value of its abscissa. The inner 
product of two points x and y is xy. As an example of the one may 
consider the set of all complex numbers. 

The space R” is the /i-dimensional ordinary euclidean space. The 
sum of two elements x G R'' and y G R") 

X = . . . ,an) y = . - . A) 


is defined as 


X + y = (fli + hiy ai + 62, . . . , fln + bn) 

Then one sees that the square of a natural norm for an element here is 

n 

ll^ll^ = X inner product of x and y is 

k=V 

{Xyij) = I aA (9-32) 

1=1 

For a complex normed space C”, each clement x G C” is defined by its 
complex coordinates. 

X = (xi,X2, . ■ • fXn) where each x^ is complex 

Expressions for norm and inner products can also he readily obtained. 

Hilbert Space. A direct generalization of R^^ will require defining 
where R^ denotes the set of all vectors of the form (ai,a2,a3, . . .) with 
a denumerably infinite numVjer of components. Here, addition and scalar 
multiplication will be defined as usual and the norm as 

ikii = >/ X 

Evidently, the norm will have a meaning if ^ a^^ is convergent, i.e., 

1=1 

BO ' 

X ■ Hence only those vectors with an infinite number of com- 

1=1 

ponents and possessing finite norms are allowed. The inner product 



312 CONTINUtfM WITHOUT MEMOBY 

of elements x and y is defined as usual : 


{x,y) = X 

t=l 

This definition has a meaning because 

\aA\ < 

00 eo 00 

Thus 2 ^ (X + X ''•0 < * 

1==1 1 1 

The totality of all such vectors which have a denumerably infinite 
number of real components and whose norm has meaning is called a r^l 
Hilbert space * i 

It is to be noted that the Hilbert space need not be a real space. A 
space C* with an inner product also forms a Hilbert space. In this case 
we have a complex Hilbert space. The inner product of two elements 
X and y is defined by 

ao 

k^l 

At the close of this short and incomplete digression into the field of 
vector space, it seems appropriate to describe in passing one or two more 
related terms. A set of elements 

[A] = [AifA2f . . . fAn] 

of a vector space S is said to form a Linearly independent set if th(U(i exist 
no nonzero ki E F numbers 

[k\ = [/Ci,/C2, . . . ,/Cn] 

such that [A][kY = kiAi + ' • ■ + k^An = 0 

A basis in a vector space is a set of linearly independent elements 

B = \Xl,X2y . . . yXm} 

such that every element x of the space can be generated in a unitiue man- 
ner by a linear combination of the elements of B. {B spans S.) 

^ = kiXi + k 2 X 2 + ■ ■ • + kmXn^ all A;^s taken from F 

If m is finite, the vector space is a finite-dimensional space. It can l)c 
shown that every linear space has a basis. 

1 wo elements x and y of S are said to be orthogonal if their inner product 
is zero; 

{^>y) = X ^ 

For complete definition of a Hilbert space, see textbooks on linear spaces. 



TRANSMISSION OF BAND-LIMITED SIGNALS 


313 


Furthermore, if 

l|aj|| = ||?/|| = 1 (9-35) 

then the two elements are said to be orthonormal. For example, in the 
ordinary space the elements x = (1,2) and y = (3, — ^^) are orthogo- 
nal, while x = (H>'\/3/2), V = {- y/^/2,yi) are orthonormal elements. 
The latter points form a basis for R^. 

As a direct extension of the above, subject to some mathematical care 
(according to the concept of convergence), one can consider a class of 
well-defined functions (say real signals with finite power in a finite time 
interval) as points in a Hilbert space. Equations (9-34) and (9-35) 
suggest the following definitions for orthogonality and orthonormality 
of two elements /(a;) and g{x): 

// [/(^)]^<^* = 11/11’“ norm 

fix) g{x) dx = {f,g) = 0 orthogonality 

[/(a:)]® dx = 1 orthonormality 

9-9. Fourier-series Signal Space. As an immediate engineering 
application of the vector-space concept, we shall show that the function 
spac.e associated with the class of ordinary communication signals pos- 
sessing a Fourier series expansion is a Hilbert space. On the basis of this 
consideration, the reader may appreciate the use of the powerful mathe- 
matical tools of vector spaces for subsecpient research in the field of 
electrical communications. 

Obviously, at present, wc shall not be concerned with the fact that the 
reader may not be immediately rewarded by the use of this modern tool . 
To be able to apply the theory requires the pedagogic development of 
many examples of applications. This is beyond our present objective. 
However, the two cases of Fourier series and the sampling theorem will be 
discussed briefly. 

Foxirier Series and the Hilbert Space. Consider a function f{x) defined 
in the interval [— ^,+7r] and expressible in that interval in the form of a 
convergent Fourier series. 


m 


QQ 00 

= y ^ ttife cos kx +^^hk sin kx 

ifc-l Jb = l 


+ « 


(9-36) 



314 

CONTINUUM WITHOUT MEMORY 


where 

1 

CLq — — 

TT ^ 

j fix) dx 



1 

aif — — 

TT ^ 

fix) cos lex dx fc = 1, 2, . . . 

(9-37) 



fixf sin kx dx 


and 



(9-38) 


[For positive /f, c*. = {aj, - jhk)/2, and for negative k, Ck = (a^ + A)/S2.] 
Without undue concern about the conditions for the existence and con- 
vergence of such series, we merely assume that the function f{x) ha,^ a 
l^'ourier series expansion and belongs to the class of square summable 
functions, that is, 

j^^[f{xy^]dx < 00 ( 9 - 39 ) 

For notational convenience, let us denote this class of functions by L^, the 
superscript 2 being a reminder of the fact that the integial of the second 
power of f{x) in the interval of definition is a finite number. Now a 
function / G can be represented by a point in a llillx'rt space. In 
fact, the doubly infinite sequence of numbers Ch in Eci. (9-3()) uniciucly 
defines the function f{x). 

[. . . ,c_2,c_i,Co,Ci,r2, . . .] 

The addition of elements and the zero element arc defined in a straiglil- 
forward manner. Thus the reader can check for himself that all reciuire- 
ments of a vector space are satisfied. Furthermore, the inner product 
and the norm also can be defined. That is, let / G be a point in the 
vector space with the coordinate [c*]. The norm of / is 

ll/ll = ( I i (9-40) 

A:=— a k = —a 

By a direct expansion of f{x) in a Fourier scries, it can be seen that 

ll/ll = (9-41) 

For instance, the coordinates of the point A i representing the function 
(sin x)ly/r are 



TRANSMISSION OF BAND-LIMITED SIGNALS 315 

The coordinates of the point Bi representing the function (cos are 


Note that 




<Ai,Br> 


IMill = llB.ii = -A 

V Ztt 

2\/ir2\/'ir 2\/’r2\/ir 


(9-42) 


Elements Ai and Bi are ortliogonal, and if the scale of the distance is 
normaliz('d with a factor of \/27r, t hen A \ and Bi will also be an orthonor- 
mal pair. The same argunumt applies to oth( r fuiu'tions listed in E(p 
(9-44). That is, for any two points A^ and Bk representing (cos kx )/\^ tt 
and (sin A'^O/V^r, respectively, we have 



<yl)^,/4>=0 k9^h (9-43) 


Thus, (‘very point of the space can be generatial in a unicpic way by a 
linear (annbination of the cU'ments of its basis: 1, v^2 (‘.os \/ 2 sin x, 
'\/2 cos 2.r, \/ 2 sin 2x, .... 

At. times, it. is more convenient to (unploy directly the functional space 
coiK^ept, that is, to consider /(^) as a point / in a Hilbert space without 
regard to it s Fourier expansion. For instance, for all signals in defined 

in [“ttjTt], we may accept the s(piare of the norm as \f{i)\^dL A basis 
for this space is 


1 cos x sill x cos 2x sin 2.r 

— — - } — - j -- 7 — i 

\/ 27r \/ TT VTT VTT VTT 


(9-44) 


An interpretation of the foregoing material in terms of electrical signals 
is in order. If /i and /‘j are two (‘lectric time signals expressible in terms 
of a Fourier series, /i, /s G lA in the time interval [ — T/2,T/2J, and K is a 
real constant, then the following analogy between the class of U signals 
and the corresponding signal space is instructive. (By signal space we 
mean the function space pertinent to the class of signals under considera- 
tion.) Table 9-1 will bring into focus some of this analogy. 

9-10. Band -limited Signal Space. In Sec. 9-0 we derived the sampling 
theorem. In the light of that material, it is instructive to look at the 
class of band-limited functions |a;ol which are sciuarivintegrable. We 
shall denote this class of function by BL^ and consider them in signal 
space 72'^. The norm is defined as before. It can be shown that all the 
requirements* of a normed spa(;e are fulfilled. 



316 


CONTINUUM WITHOUT MEMORY 


Table 9-1 


Time signals in 

possessing (ionvergtmt Fourier series 
expansions 

Vector space of the signal 

m 

Point / 

Average power dissipated by signal /(O in 
the unit resistor in time [ — T/2,T/2] 

T '^ times the square* of the length of / 

LCO = WO- Fower IS multiplied by /c’* 

Length of /a = k times the length of 

-Mi) dt=0 

Power (/i -f / 2 ) = Power fi -|- Power /2 

/i and L arc orthogonal elements 

All signals with average power P in time 

All points on the*, sphere with the radius 
\/ TP and cent(*r at the origin 

rnis power associut ed with / 1 (0 + L(0 can- 
not be more than the sum of the indi- 
vidual rms pow'ers of /i(/) and/aCO 

II/. +M < ll/.ll + IIMI 


Let us consider the point representing the function 


The functions 


. . sin (ojo^ — irn) sin {(jjut — irm) 

J\W - — hW = 


oiot — ttU 

are orthogonal elements; in fact, 

sin (co,)^ — irm) 

Wot — irtti 


.... sin (ojo/ — Trm) sii 

<.h,J‘ 2 > — / — 

y - 00 o^ofc — Trm 


((M5) 


wd — Trm 


dt =.0 if m 7*^ n 


= — for m = n 

OJo 


(9-46) 


Next, consider a function f{l) G BL^. 


m 


oo 

= V sin { o3d — nir) 

Z/ — riTT 


Xn = 



where 



TRANSMISSION OF BAND-LIMITED SIGNALS 


317 


Each signal f{t) G BL^ has a point representation in our function space. 
The power content of the signal, that is, the power dissipated in infinite 
time [— 00 ^+ oo] in a unit resistor under the effect of a voltage /(<), is 

QQ 

/;. um<d, - iwp - /;_ [ ^ (9.47) 

n = — « 

Application of Eq. (9-46) shows that 



n = — « 


(9-48) 


Thus, the electric power of the signal f{t) is numerically equal to the 
square of the distance of the corresponding element /. The class of sig- 
nals with specified power content P and in BL^ corresponds to the points 
on the surface of the sphere of radius \/P centered at the origin. 

Th(j ordinary algebraic operations on members of BU are self-evident, 
r or instance, 

/l(0 /l = [. . . ,X--i,Xo,Xi, . . .] 

hit) -*h = I . ■ ,y-i,yo,yi, . . .] 

\hit) -\-hit)]^fi+h 

= [..., a:_i + 2 /_i, xo -f yo, .ri -|- j/i, . . .] (9-49) 

GO 

/" flit) ■hit)dt^ <h,h> = -- I x,y, 

J - » 0)0 ^ 


The concept of Table 9-1, of course, holds for the space of band- 
limited signals. To sum up, through the use of the sampling theorem, 
we have been able to transfer the study 
of the class of band-limited signals to S 
the study of similar problems in an 
infinitc'-dimensional space. Further- 
more, if we also inject the (lualifica- ^ chanufl with an additive 

tion of randomness, we shall be able ■ 

It) consider a multidimensional random variable X in lieu of a class of 
continuous signals. 

9-11, Band-limited Ensembles. The change of framework from 
randomly defined continuous signals to infinite-dimensional variates 
has somewhat simplified the problem and provided a suitable physical 
interpretation. However, the simplification is not yet adequate. In 
fact, we are still faced with an infinite-dimensional random variable, or 
what is generally referred to as a stochastic process. The study of such 
processes will- be taken up in Chaps. 10 and 11. Thus, if we wish to rely 





318 


CONTINUUM WITHOUT MEMORY 


only on our acciuired background of finite-dimensional random variables, 
some further simplification will bo recjuired. To this end, Shannon 
considers only those signals in BU such that their power content is 
“principally’' contained in a finite time inb'rval [say, —{T/2) to +(T/2)]. 

We consider the class of time functions f{t) which are band-limited, 
in the range -If to +lf cycl(‘s \wr second. Acc.ording to the sampling 

theorem, each member of this class 
is fully determined by its values at 
Ihe sampling points l/21f apart. 

J‘y „ sin w(2Wt. - n) \ 



00 

I'^urlhennore, we shall make the 
“practical” assumption that f(t) is 
ncgligibl(‘ outside a time interval 
(-772,772), 


Fig. 9-(). An illustration of baiid-limited 
signals with spc'cihi'd powi^r contents in 
the signal spa(!(‘. X is a transmit tod T being a large integer. We have 

siRnal is noi.so, and F is th,- r,- .assumption as a prac- 

coivod Bignid. ^ ^ 

I.VII . vwwi m - V5r,v,v 

/ Indexed, it is reasonable m the evalu- 

||1 II - \/21 \\ (*s h A ) litiori of f[t) to stop at a place where 

the summation of Kq. (9-50) has negligible terms. However, mathe- 
matically, it is impossible to reconcile the idea of a function being limited 
in both the frequency and the time domain. The difficulties in this 
concept can be traced to the principle of uncertainty in the work of 
Gabor and others (see N-3 in the Appendix). 

An interesting observation can be made with respect to frecpiency band- 
limited signals that are limited to the time interval (—772, + 772). 
Such signals are represented by points in a 21f 7^-dimensional space. 
The aviuage power S associated with a Ij/pical sic/nal (that is, the power 
dissipated in time T by a voltage-stimulated signal applied to a unit 
resistor) is 


S = average power in time T 


~ T 2^7’ X 



TUANSMISSION OF BAND-LIMITKD SIGNALS 


319 


Then we have for the length or the norm of the associated vector 

(i = ll/ll = V'2WTS (9-52) 

All signals whose power content is less than S are represented liy points in 
the 2TT7’-dimensioiial space within a splicin' of radius cL 

To sum up, we have established a means of studying a reasonably 
general class of signals which have a preassigned power content by study- 
ing their representative points of 2lT7'-dimensi()nal space. Note that our 
present (*onfincmeiit to band- and time-limited signals provides only the 
convenience of dealing with finite-dimensional spaces. This restriction 
can be removed if one tolerates the use of spaces wdth an infinite number 
of dimensions. As long as the integral of Va]. (i)-51) converges, the con- 
cept of distance holds and our model can be used. 

Now assume that a member of this class of band-limited signals with 
specified power S is applied to a noisy channel. Let tlu' noise also be a 
member of this type of time functions but. with a powa^r content N. Then 
the output signal has a power content (hat satisfies the triangle inequality 
of Kq. (9-53), 

l|.r -H ^11 < Ikll + Ikll (9-53) 

(equality holds for independent signal and noise). That is, the repre- 
sentative point for the output signal remains on the siihere of radius r 
centered at the origin ; 

r = \/'2Wf{S'+'N) (9-54) 

Let Y be a received signal point; any point on a noise sphere centered 
at Y could be considered as a possible original signal. However, if there 
is only one possible signal near Y and listed in the transmission vocabu- 
lary, there will be no error in the decoding. Thus, care should be applied 
in the selection of the transmission signals. If the latter signals have 
some reasonable mutual distance, the effect of noise perturbation in 
decoding \vill not be too serious. A heuristic estimate of the size of the 
transmission alphabet c;an be obtained by calculating the la,rgest number 
of points on the sphere with a mutual distance of \/2IT7W. When the 
number of dimensions bec>omes very large, the volume of the sphere lies 
very close to its surface and the ratio of the volume of the two spheres 
gives an estimate of the largest possible number of distinguishable signals. 
That is, 

sp) here with rad \/2W T{S + N) 

vol sphere with rad \/2WTN 


(9-55) 



320 


CONTINUUM WITHOUT MEMORY 


The volume of an n-dimensional sphere with radius R is of the form 


where 


r «/2 


r(n 72 '+“l) 
r(a:) = dt 


(See 1). M. Y. Sommerville, “An Introduetion to the Geometry of 
N-dimensions/’ p. 135, E. P. Dutton & Co., Ine., New York, 1929; 
and S. Goldman, “Information Theory,^’ Chap. 6, Prentice-Hall, Inc., 
Englewood Cliffs, N.J., 1953.) \ 

Therefore, we find \ 

Ct = lim log M ^ W log I 
T-^ QC J 


M 




N ) 




(9-56) 


In the next section we employ this geometric interpretation for the 
determination of the entropies of this particular class of signals in the 
signal space. 

From the probability point of view, when the time limitation is removed, 
we are actually dealing with a class of signals that form a stochastic 
process. The entropy of such processes will be discussed in Chap. 11. 
For the time being, we deal with time-limited ensembles. That is, based 
on our simplified model, the signal and the noise are multidimensional 
random variables. 


{ 5(^)1 — [XifX2, . . . )Xn\ /Q r'7\ 

= [Z,,Z2, . . . ,Zn] ^ ^ 

The random variables Xk and Nk are defined at the same time, each with 
specified probability distributions. 

9-12. Entropies of Band-limited Ensemble in Signal Space. Pursuant 
to the material of the preceding section consider a point X in the signal 
space : 

Ao, . . . 

We use the notation X to indicate a point in the n-dimensional space. 
Similarly, the notation Y represents a point in another n-dimensional 
space pertinent to output signals. 

F: Fi, F 2 , . . . , F„ 

For convenience we may consider a 2n-dimensional space describing 
the behavior of the multidimensional random variable (X,F). This is, 
in a way, similar to our ordinary two-dimensional random variables. 
In a 2n-dimensional product space we have five main probability density 



TRANSMISSION OF BAND-LIMITED SIGNALS 321 


functions of interest and, consequently, five main entropies. These 
densities and entropies can be symbolically represented by 

f{i.y) HiX,?) 

m) 

My) H(?) ( 9 - 58 ) 

Um H{?\X) 

M£\y) H{X\Y) 


The rate of information transmission in tlie signal space is given by 




-// 


log 




.. dJdij 


(9-59) 


In the above symbolic vector presentation we are simply extending our 
definitions from a one- or a two- 
dimensional space to an or a 
2n-dimensional signal space. The 
meaning of this symbolic notation 
has already been described in this 
section. The details are left to the 
reader for full justification and 
comprehension. Now once more, 
as in Sec. 9-4, the problem is re- 
duc(id to developing a formula for 
the maximum rate of information 
transmission. 

Such a development can be achieved in a direct way. 
we make the further asvsumptions that 




(If2 



1 

1 

1 

1 

1 

«/.3 \a 

1 

1 

1 

“/jj 

1 J._ 1 _J_ 



Ficj. 0-7. The soquonco of random varia- 
bles [aki,ak 2 f . . . ,akj)] represents a 
point in the signal space for the original 
continuous random signal. 


For example, if 


= 0 


= 0 


(Yxif (Yx 


k = 1,2, . . . ,n 


(9-00) 


Noise has independent n-dimensional normal distribution. 

Output = input + noise f = X + Z 

Then the direct application of the method of Sec. 9-4 will lead to Shannon's 
celebrated formula 

max transinformation = W log 

It is only under the listed sequence of assumptions that the formula holds. 
If we do not wish to confine ourselves to any particular type of noise, we 
still can use the gaussian noise as a basis of comparison. Such a com- 
parison is discussed in Shannon (I and II). 




322 


CONTINUUM WITHOUT MEMORY 


9-13. A Mathematical Model for Communication of Continuous Sig- 
nals. Tn the previous secliou wo ascertained that for band-limited con- 
tinuous channels without memory, under certain plausible conditions, 
the maximum rate of transinformation is given by Ecp (9-1), The proof, 
so far, is an existence proof rather than a constructive onc^, siiu^e it does 
not present- a metliod for transmitting information at the idc^al rate. The 
following geometric model suggested by Shannon (111) is aimed at pro- 
viding a more general proof for the possibility of encoding and decoding 
continuous signals for transmission over noisy meinorylc'ss channels ajt a 
rate as close to the ideid rate as d(\sired. Shannon, und(*r somc^ general 
conditions, derives some bounds for the average probability of error ^or 



Ficj. 9-8. Quiitiiizc'd values of each word are transmitted in liiai of eonlinuous signals. 

the channel. As a result, one is able to give' a geometric proof of the 
second fundamental tlieorcun for a edass of cojitinuous memoryless 
channels. 

Figure 9-8 exhibits a number of band-limited signals which constitute 
the messages to bci transmitted. 

The following model has been suggested by Shannem. 

Source. Let T bcj the set of integers and S the set of real numbers. 
At every instant t G T, the source selects a signal s ^ S with some pre- 
specified probability. Actually the source transmits words of a code 
book as discussed below. 

Block Code. A block code consists of M band-limited words ici, 'W 2 , 
. . . , wm. Each word consists of n letters from Sj that is, 

Wk = [Skl,Sk2y . . . ,Skn\ 

where the ordinates Ski are chosen at the proper sampling intervals (Fig. 



TRANSMISSION OF BAND-LIMITED SIGNALS 


323 


9-8). For simplicity, suppose that the sampling terms beyond the above 
n terms are negligible. In order further to simplify the model, we make 
the tentative assumption that all words of the code book arc equally 
probable, i.e., 

k = 1,2, . . . ,M (0-61) 

Channel. The channel is assumed to be of a continuous type with 
additive noise. A transmitted letter Ski will be received as 

Ski + ^ki (9-62) 

The noise Xki is a random variable with specified probability distribution. 
HtTe we assumes that the noise has a gaussian distribution centered about 
the value of the transmitted letter. From a physical point of view, this 
assumption is ephte reasonable (see h^ig. 8-3). 


V./ 


y' 



rO’ 

d"""/ v; ■ 



7 / 




Fig. 0-9. A docodiiig schcMnc: All words received in the region Vk will be decoded as Wk. 

Decoding Scheme. If the transmitter has M words, we may partition 
the r(?ceiving signal space into M disjoint n'.gions such that each Wk 
corresponds to a well-d(dined region Vk^ If the decoding is to be an 
intelligent one, it must be devised with an eye to reducing the * ^probability 
of error'’ in a certain sense. In other words, the partitioning of the 
receiving universe should not be done at random. Lc^t ))c the error in 
the decoding of the word uh, that is, the probability of transmitting Wi 
and receiving not in Vi. 

= P{v^eV[\w^] (9-63) 

The average error probability for the code block is 

M M 

= P: = ^ £ P.i (9-04) 

I I 

9-14. Optimal Decoding. An optimal decoding procedure is the 
decision scheme for the partitioning of the receiving universe lefxding to a 
niinimum of E under some assumed constraints. The constraints 
assumed here are those suggested in the previous section (signals with 
specified power content, additive gaussian noise, etc.). Reference is 
made to the geometric presentation of each word in an n-dimensional 





324 


CONTINUUM WITHOUT MEMORY 


vector space. Each transmitted word Wk has a norm (distance from the 
origin) Vk. An optimal decision scheme consists of the following: If 
a word has been received with a point representation, say [/*, in the signal 
space, we compare its distance to points representing any permissible 
transmitted word, and assume that it corresponds to the closest such 
point. This procedure requires the partitioning of the n-dimensional 
space into M disjoint regions in such a manner that to each region Vk are 
assigned all those signal points that are closer to Wk than to any other 
point Wj. If a signal point is at an equal distance from two or more points 
of the V) set, we may assign it arbitrarily to any one of the associated 
regions. It remains to show that this decision scheme corn spends to an 
optimal d(H*,oding procedure. In fact, for independent gaussian distribu- 
tion in n-dimensional space, we have 

JJ ■ • ■ ^ exp ^*^2 ■ ■ • dxn (9-65)' 

[This can be seen as a special case of Ecis. (5-84) and (7-29).] 

Pet = (27r(r“)“"/^ l^j ' ‘ ' j (9-66) 

over Vi 

Now suppose that we compute two terms, say + Pm first for this 
decision scheme and then for any other decision scheme. For the latttT 
scheme, consider the following decoding procedure: Assume that a signal 
point A which is closest to Wj G V, should be assigned to the region Vi 
instead of Vj, Since Taj < tai and the gaussian probability distribution 

is a monotonically decreasing function 
of r, according to Eq. (9-67) one can 
see that the above decision scheme 
leads to a lower E. 

j, (n + n) > jg (ft/ + ft) 

(9-67) 

This reasoning can be further ex- 
tended in order to show that the above 
decoding scheme ^i^lso called maxi- 
mum-likelihood detector) is an optimal 
decoding procedure [see Gilbert (I)]. 
An example of this partitioning is 
suggested in Fig. 9-10. If we had only three signal points Ui,U 2 , and 
in the signal space, the three would determine the regions Fi, F 2 , and 
F 3 . Note that, if one makes the further assumption that all the sig- 



Ficj. 0-H). Minimum-distance decoder: 
All received words closest to Uk are 
decoded as Ih- 


TRANSMISSION OF BAND-LIMITED SIGNALS 


325 


nal points (in n-dimensional space) have equal power, then the maxi- 
mum-likelihood decoding regions will consist of n-dimensional polyhedra 
with apexes at the origin of the coordinates. Each polyhedron is bounded 
by y. hyperplanes [(n — 1) -dimensional, ^ < M — 1] (Fig. 9-11 for three- 
dimensional space). 

An Encoding Problem. We have now established a ^vorking model of a 
continuous channel. For instance, if it is desired to transmit waveforms 
selected from a finite set of band-limited continuous signals, each word 
Wk may be chosen as the vector representing the totality of the sampled 
values of the kih signal. The central problem for this communication 
model is to devise suitable codes, that is, a decision scheme that mini- 
mizes E subject to certain plausible constraints. If no additional con- 
straints were imposed, then the best solution would consist of some sort 
of equidistant placement of M points in an n-dimensional space. 

Most of the practically useful and theoretically interesting constraints 
stem from the fact that the original signals must have a limited power 
content. The following set of three hypotheses was considered by 
Shannon. 

1. All signals (words) have the same power content P. 

2. The power content of signals (words) is smaller than or ecpnil to a 
specified power P. 

3. The average power content of all signals (words) is small(;r than or 
equal to a specified P. 

Thus the corrcisponding encoding problem is to minimize E by proper 
placement of M points in the n-dimensional space with the following 
constraints : 

1. M points on the sphere centered at the origin with radius y/nP^ 
respectively 

2. M points within or on the sphere centered at tlu‘ origin with radius 
\/ nPj respectively 

3. M points such that their average squared distance to the origin is 
nP 

9-16. A Lower Bound for the Probability of Error. We fotms our atten- 
tion, along with Shannon, on the communication of continuous signals in 
the presence of additive noise, when the code words lie on a sphere of 
radius \/nP (case 1) centered at the origin. Let N be the variance of 
noise at each sampling point, n the number of sampling points, and M the 
number of code words. Familiarity with the material of Chap. 8 makes it 
clear that, for an optimum encoding, the average error probability will 
be a function of A/, n, and P/N. Because of the additivity of noise, 
only the ratio of the variances of P and N will enter the picture. Shan- 
non’s procedure for obtaining some bounds of E can be described in terms 
of a geometric model. The details of his derivations require much more 



326 


CONTINUUM WITHOUT MEMOHY 




space than that available here. We shall quote the method of attack 
and some of the results obtained. For details of the proof the reader is 
referred to the original reference [Shannon (III)]. The suggested method 
for obtaining a lower bound for the error probability is rather interesting. 
It is based on the following two direct steps: 

1. Consider the error associated with this minimum-distance decod- 
ing scheme, that is, when the signal space is divided into appropriate 
polyhedra. Then evaluate the corresponding probability of error. 

2. The exact evaluation of the error probability for the above geometric 
model seems to be a complex problem. Therefore, one may approximate 
the probability of error for this scheme by comparing it with the (;rror in 

any similar model which may be subjected t|o 
simpler computation. 

The probability of error associated with a de- 
coding scheme, for a specific, word is equal\ 
to the product of P{Wi] and /^(received word 
not in the tth polyhedron . In our adopted 
model, = 1/iif, and the other pro})ability 

is rather diflicult to compute. 'J'he following 
basis for comparison may be (‘stablished. 
(consider a two-dimensional monotoiiically 
decreasing probability density function f{r) 
with /(oo) = 0. Compare the probability of 
the random variable R assuming values in a 
circle C centered at the origin with the 
probability of the random variable assuming 
values in a polygon G of the same area (see Fig. 9-11). Let ,1 stand for 
the common area between C and G; then 



\ 


Fio. 0-11. CoinpariBoii of 
probability witliiri a circle 
and a polygon of equal 
area for a monotonically 
decreasing probability den- 
Hity. 


P{R G C] = P{R G + P{R eCnRGG\ 


(9-08) 


Now, if we compare the probability associated with any two elements of 
equal area, one in the polygon but not in the circle and the other in the 
circle but not in the polygon, we shall conclude that a smaller probability 
is associated with the former element. Thus 


P{(R G (?) n (fl G c)) < P{{R ec)r\{RG c)\ (9-09) 

This reasoning can be extended to an analogous situation in the n-dimen- 
sional space. We may compare the probability of a signal point being 
in the n-dimensional polyhedron (or pyramid) with the similar probability 
for a right-angle n-dimensional cone, both with solid angle fit. (ili is the 
area cut by the pyramid or the cone on the unit n-dimensional sphere. 



TRANSMISSION OF BAND-LIMITED SIGNALS 


327 


Both the pyramid and the cone have their apexes at the origin.) The 
signal point uh lies on the axis of the cone at a distance \/ nl\ Thus, 
we arrive at a practical method for obtaining a lower bound for the proba- 
bility of error, as the computation of the latter probability can be directly 
accomplished. 


E 


M 

/^(signal being moved outside 7'th conej 

i- 1 



M 


I 




The probability function Fiil.) is a monotonically decreasing function of 
distance and also convex in the n-dimensioiial space. Thus 


where 



(9-70) 


Shannon finds it easier t.o compute the probability of a signal point 
being displaced to outside a cone of half angle B than to designate 



Fio. A sc-hematio diagram of the minimum-distance draioding pyramid in n-space. 

the cone by its solid angle. He refers to this probability function by 
Q(0). Having this new variable 6 in mind, we find 

E > F = Q(0,) (9-71) 

where 0] corresponds to the solid angle (1/Af)fl(7r), that is, 

ft(ei) = 

The actual computation of Q{d) turns out to be quite long and more 
complex than can be presented in a few pages. In Sec. 9-17, we shall 
quote Shannon's results without taking the space for their derivations. 

9-16. An Upper Bound for the Probability of Error. We have pointed 
out more than once that the proper selection of the words in the code 
book is a most important factor in reducing the probability of error. In 
this section it will be shown that, even if the code-points are selected at 



328 


CONTINUUM WITHOUT MEMORY 


random, the average probalnlity of error of the code cannot surpass some 
limiting value. The evaluation of this bound is the subject of the present 
section. 

Consider a circular cone with half angle 6 about a re(;eived word v^ 
corresponding to a transmitted word w^. If the cone does not surround 
any signal point, the word will be unambiguously detected as w^. How- 
ever, if there arc other signal points in this cone, they may be incorrectly 
de(!oded as the original message. The probability of a code-point being 
in this cone is the ratio of the solid angle of the cone to that of the total 
space surrounding i;*, that is, 


P { one code-point inside cone ] = 


iliir) 


(9-p) 


Assuming the code-points are independently (at random) distributed An 
the sphere of radius \/ 7iP, we find \ 


P I M — 1 code-points in cone 1 = 


P{none of il/ — 1 code-points in cone 


r, _ 


( 9 - 73 ) 


Actually, in order to compute the average error probability for the decod- 
ing scheme, we must first compute the probability of the transmitted 
signal being displaced in the region between a cone of half angle and oiKi ^ 
with half angle 6 -f- dd. This latter probability of the original signal 
being displaced by noise has already been designated by the notation 
— dQ{d). Thus, the average probability of error for a random code is 
given exactly by 



( 9 - 74 ) 


Some simplification will occur if this exact formula is somewhat weak- 
ened by using the inequality (1 — xP > 1 — nx, which suggests 



< {M - 1 ) 





( 9 - 75 ) 


By dividing the range of integration into two parts 10,0'] and [0',7r], one 
finds 


E < - 




( 9 - 76 ) 


Finally, we obtain an upper bound for the average probability of error 
by assigning to the arbitrary angle 6' the value of the cone half angle 
that corresponds to finding only one signal point in the cone, on an aver- 



TRANSMISSION OP BAND-LIMITED SIGNALS 


329 


age. That is, 6' = di, where 9i is delincd by 

12(0,) = jJ^12(7r) 

Thus, the average probability of error satisfies the inequalities 

Q(0.) <E< Q(0.) - 1^' §l^dQ(0) (9-77) 

In the following section, wo discuss the asymptotic behavior of this 
error probability when n is increased indefinitely. 

9-17. Fundamental Theorem of Continuous Memoryless Channels in 
Presence of Additive Noise. Using the iiie([ualities of l']q. (i)-77), wo 
wish to study the asymptotic behavior of the errt)r probability when the 
rate of transmission of information approaclu's tlu^ channel capacity. 
Without going into Shannon\s elaborate derivation of the probability 
function Q(0i), we merely quote his result when n asymptotically 
approaches infinity. Shannon shows that if n — ^ oo and if the signaling 
rat,e approacthes the channel (capacity (per degi‘ee of freedom), that is, 

= bog M ^ ^ log ^1 -1- = (7 

then the upper and the lower bound of Eq. (9-77) will coincide. In this 
case he has shown that the probability of error approac.hes the CDF of a 
standard normal distribution : 





2i\F+ N) 
N{P + 2N) 


{R - O 


(9-78) 


Expression (9-78) exhibits Shannon's fundamental th(‘orem for continuous 
channels in the presence' of additive gaussian noise. For any arbitrarily 
small but positive* e = C — R, the probability of error for the codevs 
approaches 0 as n ^ oo. From expression (9-78) it can also be noted 
that, with a fixed positive €, it is not possible to encode messages su(;h 
that one may transmit at a rate greater than the channel capacity with 
arbitrarily small error probability. 

Note. An early conception of this idea appears in Shannon's article in 
the 1949 Proceedings of the IRE. The treatment in the recent (1959) 
Bell System Technical Journal article is quite elaborate. In the pre- 
ceding pages we have tried to present Shannon\s basic method of proof 
without recourse to the more complex techniques initiated in that paper. 
The omission was felt necessary in order to remain within the bounds 
of an elementary presentation. The reader who wishes to appreciate 
fully Shannon's techniques should consult the original paper. 



330 


CONTINUUM WITHOUT MEMORY 


The following supplementary articles for this chapter are suggested: 
Gilbert [I], Eiee, and Thomasian. The first article makes use of the 
concept of distance and power for the transmission of a class of band- 
limited signals in the signal space. The second article, which historically 
is the closest one to Shannon [II], contains several interesting results. 
For instance, it is shown that the “reliability’’ of the transmission of 
band-limited signals in the pn\sence of independent noise when the trans- 
mission rate is close to the (channel capacity is approximat ely 


H {C - Tty 

Shannon’s equivalent result suggests a sharpen- estimate [Shannon |(III, 
p. 642)], 


(P + NY^ 
P(P~+2N) 


(C - Ry 


The third afore-mentioned paper derives a relatively simple bound for t^he 
probability of error. So far, the sharpest results (but not the simplest) 
have been obtained in Shannon [III]. 

9-18. Thomasian*s Estimate. A. J. Thomasian has recently made 
a study of continuous channels in the presence of additive noise when 
power contents of the signal and noise are limited (Sec. 9-15, case 2). 
A summary of his results is given lielovv. 

Let 

Uk = [Xkiya'k2, . . . yXkn\ 


be a possible input word. We assume 


- y -rAr < P 

n Z-/ 

j = i 

As before, we designate the corresponding output word by 

Vk = [vkumt, ■ ■ ■ ,ykn] 

If the average noise power per coordinate, A > 0, is specified for additive 
independent gaussian noise with mean zero and variance A, the output of 
the channel is constrained by 

n 

- n vb w to- - 

The following version of the fundamental theorem for the above com- 
munication system (additive independent gaussian noise) is that of 
Thomasian. 



TRANSMISSION OF BAND-LIMITED SIGNALS ^'*1 

Theorem. Subject to the above assumptions, there exist M distinct 
input words [wi,W 2 , . . . ,i/m] 


M < 

such that 


Ufc [^klf^k2j • • • 

k = 1, . . . ,M 


(9-80) 


M 

) <3 exp 

k = l 


n 0.8((7 - R)H r + N) 


where C = channel capacity in bits per signal coordinate 

R = suitable rate of signaling; i.e., 0 < R < C = log (1 + P/Ny^ 
= output word decoded when input word is Uk and optimal decod- 
ing procedure is employed; D[ is set complement to Dk 
This theorem gives an upper bound for the average error probability in 
the described communication model. The decoding criterion is based on 
minimum distance. For the sake of simplicity we have disregarded the 
case where several words (;an be associated with a single transmitted 
word. Although Thomasian's formulation of the problem is similar U) 
Shannon's approach, his derivation and proofs are (luite different. 
Thomasian's proof is based on some basic lemmas and iiuuiualities similar 
to those employed by Feinstein and Wolfowitz (see C/hap. 12). A full 
derivation of this interesting theorem re(iuires more time and space 
than is available at present. The interested reader is referred to the 
original article.* 


PROBLEMS 


9 - 1 . A microphone does not let through sounds above 4,000 cycles per second. 
Determine the .sampling interval, allowing reconstruction of the (jutput waveform 
from sampled values. 

9 - 2 . Prove the sampling theorem for a time function f(t) which is identically zero 
outside the interval a <(< b. 

9 - 3 . (a) Show that the volume of an n-dimensional hypersphere of radius K is 


where 


f(R) = 


i; = R-f(R) 



n ft”-* exp (-{ftV2)l dft 


(6) For large n, plot the integrand in the denominator as a function of R. 

* A. J. Thomasian, Error Bounds for Continuous Channels, Proc. Fourth Interna’- 
tional Symposium on Information Theory^ to be published in 1961. See also Black- 
well, Breiman, and Thomasian. 



332 


CONTINUUM WITHOUT MEMOBY 


9-4, Compare the ratio of volumes of two hyperspheres of the same radius but 
dimensions n and n — 2 , respectively. 

9-6. Let X = [Xj,X 2 , . - . ,Xn] be a random vector with gaussian distribution: 
Mean of X* = 0 

Momemt X^X, =0 k == 1, 2, n j 91 ^ k 

Standard deviation of X* = <r 


X may be rcpre8cnl(‘d by a point X in an n-dimensional space. 

(а) Given a point X of this space with a distance R from the origin, discuss the 
relation between H and the power content of the signal. 

(б) What is the probability of having points representing signals of this ensemble 
within a sphere of radius R about the origin? 

(c) Same question as in part (6) for the point Ixiing between two spheres of respec- 
tive radius R and R + dR. I 

9-6. Verify if a bandpass function restricted to the frequency interval \w, (X \)w 
can be expanded as 



sin 2 irw(\ H- 1) — sin 2 Trw\(t — n / 2 w) 
2 Trwii — 71 /2w) 


\ 


9-7. Verify the validity of the following identity due to C 11, Calm. If f{t) is a 
periodic function of period T and if all the li'ourier eocdlieic'nls vanish above the nth 
harmonic, then 


m 


c = n 


sin {{ 2 N + l){ 7 r/T)[t ~ kT/( 2 N + 1)1 1 
l2N + 1) sin {{n/m -■A:7V(2A“+ J")] | 


(S. Goldman, “Information Theory,’’ p. 83, rrentice-Hall, Inc., Englewood Cliffs, 
N.J., 1953.) 

9-8. Show that the three vectors 


HI 


^1 [H, -H, H] 


form an orthonormal basis for R^. 

9-9. Show the validity of 

l|Jr + rii<||xi|+||r|| 

in by a simple study of an associated triangle. 

9-10. Prove that, if 

lix + r|| = 11x11+ II Fii 

then X and Y are linearly independent. 

9-11, Let/(0 be a function of period 27r such that 


Show that 


fit) == 1 0 <t <Tr 

fit) = -1 -TT <t <0 
fit) =0 t = 0 OT t = TT 


m 



sin (2n — l)t 


2n - 1 



TRANSMISSION OF BAND-LIMITED SIGNALS 


333 


9-12. Show that the functions 


1 




cos I 



cos 2t 


form an orthonormal set in L* with respect to the interval [0,7r]. Verify the distance 
inequality. 

9-18. Prove that if A, and X are voltages in a two- or three-dimensional signal 
space (using, for example, the space corresponding to the sampling theorem), and 

\A - A'l =\B-X\ 

A B 

then the signals B — A and X - - are orthogonal. 

9-14. Show that the surfatie of an n-dimensional sj)here of radius R is 

V T) 


9-16. Using the foirniila of the preceding problem, with the notation of Sec. 9-15, 
show that 

, . iliOi) . r(n/2 d- l)(8in 

im - nr[(n H- l)/2l7r!'^ 

whore Oi is the cone angle sucdi that the solid angle the n-dimensional cone is 

U{0j - (sin 


9-16. Let [Y-n ' * • be a random vjinable corresponding to 

the sampling intervals of a band-liinitcd signal. Assume all Yk mutually independent 
variablcLS with normal distribiitjons and equal standard deviation o- = 1. Show that 
the probability density of the square of the distance, that is. 


h^N 

X = 2 


depends only on the sum of the scpiare of the averages of Fit's, but not explicitly on 
any individual variables. 

k==N 

Ao = y 

k=-N 

Can you derive the probability density function P\X\1 (See Rice, particularly 
Sec. 4 and Appendix 1.) 

9-17. Let Xi, X 2 , . . . , Xn be mutually independent random variables with a 
common-probability density function /(j), g(x) a real-valued function, and iS the set 
of all points in the n-spacc such that 


y > nd d is a. fixed number 

jfe-i 



334 CONTINUUM WITHOUT MEMORY 

Then prove that for any t > 0 

j ‘ j ‘ ' ■ ^ 

s 

This inequality is used by Thornasian.* Its proof follows directly. (See also Loeve, 
p. 158.) 

9 - 18 . In the preceding problem, let all variables have standard normal distributions 
and fi > 1. Consider g{x) = and show that 



PART 3 


SCHEMES WITH MEMORY 


The hringiiiK together of theory and practice leads to the most favorable 
results; not only does i)ractice benefit, Init sciences themselvtis develop 
under the influence of practi(‘e, which reveals new subjects for investigation 
and new as])ects of familiar subjects. 

1\ L. Chebyshev 


Quoted by Khinchiii in Uspekhi Mai. Nauk, vol. 8, no. 8, p. 3, 1953. 




CHAPTER 10 


STOCHASTIC PROCESSES 


The maihematieal theory of probability has p;rown tremendously 
during the past three decades. Today probability encompasses several 
professional fields of mathematical endeavor, mvh as game theory, 
decision theory, time series, and Markov chains. 

Since 1940 several mathematicians have made fundamental contribu- 
tions to the establishment of the new science of statistical theory of com- 
munications. Perhaps two of the most significant landmarks of com- 
munication theory which have immediate bearing on this subject arc the 
Wiener-Kolmogorov theory of filtering and prediction and Shannon's 
information theory. In both cases, stochastic processes occupy an 
important place. In communication theories, messages that are trans- 
mitted in time intervals are generally dealt with ; that is, the raw material 
consists of time series. This immediately gives rise to questions on the 
statistical nature of these time series and their accurate description, both 
at the entry and at the exit of physical systems. 

The theory of filtering and prediction has been primarily concerned 
with problems of determining optimum linear systems in the sense of the 
least-square criterion, for extraction of signals from particular mixtures 
of signals and noise. This is a major probh'm in communication theory 
with frequent practical application. However, it seems that this specific 
topic has been somewhat overemphasized. At present a broader out- 
look, the general study of linear systems under stochastic regime, appears 
desirable. Such a study seems to be the most natural extension of the 
ordinary network theory which is concerned with the study of linear 
systems under deterministic time functions. Today this well-developed 
area of deterministic linear systems occupies an important position in the 
technological development of our applied sciences. Therefore, further 
knowledge of stochastic processes is essential to physical scientists 
interested in extending the study of deterministic networks to probabilis- 
tic systems. 

Information theory deals with messages and their ensemble trans- 
formations. There again a fundamental study of the problems involved 
requires a knowledge of stochastic processes. 

337 



338 


SCHEMES WITH MEMORY 


The present chapter constitutes an introductory survey of the theory 
of stochastic processes. The application of such processes to linear 
systems is the general theme of the statistical theory of communications. 

This chapter is aimed at giving the communications engineer a short 
systematic treatment of the subject without too great a sacrifice of 
mathematical rigor. Those professionally interested in this field will 
find a large number of recent books and articles available for a more 
specialized coverage of the subject (see Middleton and Davenport and 
Root). 

10-1. Stochastic Theory. In a first attempt to acciuire some knowledge 
of probability theory, the subject must be confined to what may be (galled 
the static part of probability theory. This chapter will presei^t an 
expose of the dynamic^ or, more appropriately, stochastic^ part of the 
th(‘-ory. It will study probabilistic phenomena which depend on time 
or any other real parameter. In the mathematical literature this part is 
referred to as the study of time series, or, more technically, a study of 
stochastic processes. The immediate objective is to acquaint Ihe 
reader with a method of analysis of time series. This will require the 
introduction of some new terms and a reappraisal of the more elementary 
concepts of probability theory. Subsequent to defining time series, 
there will be a study of averages and expectations. Finally, we relate the 
study of the averages to the well-established theory of Fourier integrals, 
in much the same line of thought as relating the concept of moments to 
characteristic functions through Fourier integrals. 

The following intuitive definition of a stochastic process can be given.* 
A time-dependent stochastic process is a random process whose out- 
comes are infinite sequences or functions, in contrast to a simple ran- 
dom variable whose outcomes are numbers or vectors. In other words, 
a stochastic process 1A(0) is a probability process whose outcomes are 
functions of t. The values of the process at times ^i, . . . , tnj that is, 

X{ti)j X{t 2 )j . . ■ , X{tn)j form a sequence of random variables. 

At each instant of time U there is a random variable X(U) with a 
specified probability distribution. The stochastic process can have a 
discrete or continuous time parameter. A discrete process consists of a 
finite or an infinite sequence of random variables each defined at a dis- 
crete time. Without loss of generality, assume the time sequence to be 
integers . . . , —2, — 1, 0, 1, 2, 3, . . . . Then the process will be 
designated by 

X{t) = 1. . . ,A_2,X_i,Xo,Xi,X 2,A3, ... 1 
For a continuous stochastic process, the outcome at any desired instant 

• See J. L. Doob, Am, Math. Monthly ^ vol. 49, no. 10, pp. 649-653, 1942. 



STOCHASTIC PROCESSES 339 

ti G ^ Xu) is a random variable. Examples of a stochastic process in 
discrete and continuous time are sketched in Fig. 10-1. 

In order completely to specify a random process, the joint probability 
distribution of any number of random variables of the process must be 
known. To be specific, for any integer n and any set of real numbers 
tij Uj Uy • - . ,tn belonging to the time interval of the process there must 
be a set of random variables X(^i), X(^ 2 ), X{h)j . . . , X{tn) with a 
known joint probability distribution* 

P{Xu < Xl, Xi, < X2, . . . , < Xn\ 

As an example of a discrete process, consider the repeated throws of a 
biased coin, which are head or tail, with respective probabilities p and 


■3 -2 -|l 

lo 

|l 2 

13 4 


x(t.2y 



1 



(a) 



ib) 


Fig 10-1. (a) Example of a discrete-time stochastic process; (h) exampU^ of a con- 
tin uoiis-tiine stochastic process. 

1 — p. Call the result of the /cth throw a random variable Xky which 
assumes one of the two numerical values, say 1 corresponding to a head 
and 0 corresponding to a tail. Here the family of the random variables 
X consists of 

A(0 = . , . , X_2, X_i, Xo, Xi, X2, X3, . . . 

The continuation of the throws in both numerical directions is a matter of 
mathematical convenience rather than anything else. In this example, 
note that any two members of the family are independent of each other. 
In general, these random variables need not be independent. Each 
random variable of the family has a well-defined probability distribution, 
and all the joint probabilities for two or more members of the family are 
completely defined by the binomial law. 

* It is to be noted that, while for the continuous processes the selection of the time 
sequence ...» t-u ^0, h, U, . . . is arbitrary, for a discrete process the time sequence 
must be selected appropriately from the doubly infinite sequence of the discrete times 
of the process. In the latter case, the number of variables is, at most, denumerable. 



340 

SCHEMES 

WITH MEMORY 


For a simple illustration of a continuous process, consider a large num- 
ber of radio receivers registering a stochastic noise voltage, as sketched 
in Fig. 10-2. The voltage registered by each receiver is an outcome of the 
stochastic process, or a member of the ensemble function. 

Select an infinite number of time-sampling points on the t axis about 
some time reference t = 0, such as f_ 2 , i-i, <o, Uf ^ 2 , . . . . At a sampling 




First receiver 

/\ A A'A/" 

\A/\ 


1 

\ 

vvY 

V’V 



1 

Ao 

\ 

Second received 


Ol 


HH 


fcth receiver 


Fia. 10-2. An illustration of dilTcront outcomes of a process 


time U the values of the registered noise for different receivers will be 
designated by 


Ai(ti)f ■ ■ • > , z — — 2, — 1, 0, 1, 2, . . . 

Assuming that a great many receivers are available, it should be possible, 
at least theoretically, to estimate the probability distribution of the 
noise amplitude A{U) = X{U)\ 

P[x < < X + dx] = S{x%) dx (10-1) 

the probability that at time U the registered noise voltage lies between x 
and X + dx volts. This is called the first probability density distribution 




STOCHASTIC PROCESSES 341 

of the process. To be more exact, the first probability density distribu- 
tion is 

f{x]t) dx 

defined for all sampling times. Similarly, the second probability density 
distribution is defined as 


f(XhX2; dxi dx2 

for any values of /i and <2 [the joint probability of the noise voltage remain- 
ing between a given interval (xi, xi + dxi) at sampling time ^i, and within 
the interval (xo, X 2 + dxn) at another specified sampling time (k)]. This 
function should be defined for any finite pair of time points (tkjtj). 
Finally, the nth-order probability density function of the process can be 
defined as 

/(Xa, . . . ,X_2,X_l,Xo,a:i,X2,a:3, . . . ,Xft,/a, - . , 

/_2,L • . ■ itb) (10-2) 

For example, 

fix oj • • • ,X_ 2 ,X_i,Xo,Xi, . . . jXbytaj • • • > 

. . . ,/fr) dXa ‘ • ■ dXb 

means the probability of finding simultaneously the values of the follow- 
ing random variables with the specified times in the range specified below : 


x_2 < X{t-2) < x_2 + dx^2 

x-i < X{t-i) < x_i + dx_i 

xo < ^(^ 0 ) < xo + dxo (10-3) 

Xi < X{ti) < xi -H dxi 

X2 < X{t 2 ) < X2 + dX 2 


The order n implies the consideration of the joint distribution of n 
random variables at specified sampling points. When the joint proba- 
bility distributions are known for any selected finite k points of the inter- 
val of the process and for /c = 1, 2, . . . , n, then we consider that the 
process is known up to order n. Evidently the given data must be 
collectively consistent. The following section presents examples of 
stochastic processes. 

10-2. Examples of a Stochastic Process. Consider the ensemble of 
the time functions 


X = sin (f + B) 


(10-4) 



342 


SCHEMES WITH MEMORY 


where t is the time and 9 a random variable with given distribution density 
p(d). For each value of say /o, < 1 , / 2 , . ■ . , the function X assumes 
values associated with different random variables: 



X{L,) 

8in «_1 + e) 

i{) 

X««) 

sni (/() + 0) 


— > 


<1 

Xiu) 

sin (b + 0) 


X(h) 

sin (b + 9) 


I 

The probability density function, say at time t = /•), can be completed 
from th(^ knowledge of p{6) and according to the rules of tratisformaVion 
of a random variable (see Sec. 5-11). Similarly, to compute the joint 
probability density, \ 


f{xi,X-2]li]t-2) dxidx^2 = P 


Ixi < sin (b + 6) < xi + dxi 
|:c _2 < sin (^_2 + 0) < X -2 + dx-> 


Note that, in this example, .Yi,X 2 , . . . arc actually functionally depend- 
ent, which is stronger than statistical dependence. For processes whose 
sample values at any Ik and ij are independent, the desired joint proba- 
bility density distribution is the product of the two individual first-order 
densities. As in the above, these sampled variables need not be inde- 
pendent in general, but the nature of their intcrdepcmdence must be 
specified. 

As a second example, consider the process 

X = A + Bt (10-G) 

where A and B are independent random variables with normal distribu- 
tions (zero mean and c standard deviation). Then, 


t-i A (^_i) A -\~ Bt^i 

to X{to) A + Bto (10-7) 

ti A(b) A + Bti 


As tk is a constant number for each sampled variable, X is the linear com- 
bination of two independent variables, each with a given normal dis- 
tribution. Thus, the first-order density for the ensemble is well-defined. 
For example, at ^ = 2, 


Xa = X(b) = A + 2B 



STOCHASTIC PROCESSES 


343 


The random variable X 2 , consisting of the linear combination of 
two normally distributed random variables, will itself have a normal 
distribution : 


with 


f{Xi) = 


1 X2^ 

;= exp — 

(To \/ 2ir -^^ 0 " 


<^0 


= \/(l + = Vs 


a 


Similarly, the joint density for any two or more sampled variables can be 
computed without any basic difficulties. 

10-3. Moments and Expectations. This section develops some averag- 
ing considerations for stochastic processes. This is similar to the concept 
of mean values for ordinary functions, or the concept of monumts and 
expectation of random variables. 

There are two kinds of averaging involved in stochastic processes. The 
first kind deals with moments of the nth-order density distribution of the 
stochastic process; in the second kind of averaging the different averages 
associated with one or more time functions which are members of the 
process are taken. 

Ensemble Averages, The reader is already familiar with the moments 
of different orders associated with a random variable. In the case of a 
stochastic process, the same idea applies. For example, the /cth-order 
moment of the first-order density distribution of the process is defined as 


m,{t) = dx (10-8) 

The first- and the second-order moments for the random variable 
X{U) are 

X{U) = = j ^ dx (10-9) 

[X{t,)V = = j dx (10-10) 

The familiar concept of the central moments and standard deviation, 
as discussed in Chap. 6, may also be employed: 

rl ( 10 - 11 ) 

” nii{t,)Yf{x;i,) dx 

Next, it is natural to study the different moments associated with the 
second-order joint distribution of the process; for in.stance, 


X{tk) ■ X{tj) = XkXj ■ f(xk,Xj\ik,tj) dxk dxj (10-12) 

The idea of ensemble averaging can be extended in an obvious manner to 
higher moments and higher-order joint probability densities of the 
process. 



344 


SCHEMES WITH MEMORY 


Time Averages. The concept of time averaging deals with the differ- 
ent averages computed from one of the time function members of the 
ensemble if such averages exist. Let !A"(0) be a process depending on 
several random variables Ai^ A 2 ^ . . . , A and time t. For a given set 
of values of Ai, 2 , . . . , Aky we have what is referred to as an outcome. 
For example, to concentrate on one particular outcome of the process, 
{X(0}, an average value for this outcome can be defined as 

(X) = lim r X(0 di (10-13) 

The definition is contingent on the existence of a finite value for the limit. 
For some outcomes of X(0, {X) may not exist. y 

Similarly, introduce a second-ordcu* averaging for each member of\the 
ensemble as \ 

(X«) ■ X{1 -h r)) = lim i r X{t) ■ X« + r) dt (10-i4) 

This average gives a measure for the interaction or coherence between 
the values of the time function under consideration at a time t and time 
< + r, where r is any real time interval. This type of averaging is very 
common in engineering problems. It may be added that time averaging 
is associated with a known member of the ensemble. The procedure does 
not exploit the knowledge of the probability distributions of the totality 
of the ensemble. It simply indicates the averages pertinent to a particu- 
lar member. These averages may or may not be identical for all mem- 
bers of the ensemble. 

10-4. Stationary Processes. In nontechnical language, a stochastic 
process is said to be stationary when the process is temporally homo- 
geneous; i.e., its statistical properties remain invariant under every 
translation of the time scale. In technical language it is implied by a 
stationary process that 

P{Xi ^ fli, X 2 ^ U 2 , X^ < Og, . . . , Xn ^ Uni 

= F(ai, 02,03, . • • ,On, i\yt*lyhy ■ • ■ ,^n) 

= F(Oi, 02 , 03 , . . . jOri, ■ ■ ■ ,«n+T) (10-15) 

T being any real number, the identity must hold for every finite T and all 
appropriate choices of U, hy ... y in and A"i, A'-^, ■ . • , Xn- (If this 
relationship holds for every finite integer for 7i = 1, 2, . . . the process 
is strictly stationary; otherwise the process is stationary of order fc, k 
being the highest integer for which the above relationship holds.) For 
many practical problems one is often confined to the cases where station- 
arity holds in the first- and second-order distributions. An equivalent 



STOCHASTIC PROCESSES 


345 


kiterpretation of the stationarity is the fact that all joint probability 
densities for XkXj depend on the time difference (fc — j) but not on the 
absolute value of the time. Any outcome of the strictly stationary 
process, that is, any member of such ensemble, if shifted in time, will 
lead to another outcome of the same ensemble. The stationarity of the 
first order implies 

E[X{tk)] = E[X(tk+m)] for every k and every m (10-16) 

The biased coin of Sec. 10-1 gives an example of a stationary stochastic 
process as successive trials are assumed to be independent. In the 
example of Sec. 10-2, X = sin (t -j- 0), the density distribution for 
X(tk) and X(tj), is identical for the particular case when 9 is uniformly 
distributed in an interval of length 2t. 

The second-order stationarity implies that for any set of (hjtj) 

E\[X{tk) - Xih)]\X{tk + r) - X{tk+T)]} 

= E{[X(t,) - xmx{t, + r) - Xitj + r)l) = p(r) (10-17) 

This is an immediate consequence of the invariance of the second-order 
joint probability distribution under any shift of the origin of time. 
However, the converse of this statement is not necessarily true. A proc- 
ess may obey the latter equation without being stationary of the second 
order. Because of the importance of the concept of stationarity of 
the second order, one must supplement the above definition, which is 
based on the invariance of the joint probability distribution, with a less 
restrictive definition, that is, the invariance of the second-order expecta- 
tion. The following definition is generally accepted in technical litera- 
ture: A stochastic process is said to be stationary of the second order 
(or stationary in the wide sense) when the following two conditions arc 
satisfied: 


(I) E[Y{t)]^ < 00 foralUe T 

(II) E[Y{QY{ti -h t)] = p(t) = function of r only, for all t = U 


The advantage of this less restrictive definition is the fact that it generally 
provides a simpler means of determining the statistical character of a 
process. The second-order moment described in condition II is basically 
the same as the second-order central moment described earlier. In 
fact, consider the transformation of the variable 

Yit) . ( 10 - 19 ) 

Vii'iA-(i) - Bmoii’ 


and note that 


Y(t) = 0 



346 


SCHEMES WITH MEMORY 


Thus the condition previously described by the second-order central 
moment of the process {X(0} will lead to condition 11 when the first- 
order moments are selected to be zero for all values of time. 

Example 10-1. Consider the stochastic process 

(X(01 = At + Bt^ 

where A and B are independent random variables with zero means and equal standard 
deviations a. Compute the autocovariance function of th(5 process. 

Bohihon. The random variable Xijix) is a linear combination of two independent 
random variables : 

X(^t) = A tx Bt^^ 

E[X(Q] = lE(A) + L^E(B) = 0 / 

The second-order expectation is 

E[X{t,) ■ X(l,)] = E[(At, + Bt.^)iAl, H- 

= E[AHA- + ABuHk + BAUk^^ + B%V,^\ 

= t,tkEiA^) + {h + IkWkEiAB) + t,%^E(B^) 

= tM\ + UhW '' 

The second-order central momtuits are 


E[X(t,) - = ElX^ik)] - (/,=* + tA)a^ 

E\X(k) - X{tk)V = E[X^{h)] = 4- 

The autocorrelation bee,omes 

^ AM[X (t.) - I TFOIf X W - 7^11 

VeT\X{Q - XITDlM • A’ I IA: (h) - xtsip i 

‘ ~ tik VoTT ~ V(T"4^(T +11)^ 


The process is evidently not stationary, as the autocovariance function, that is, tlu^ 
numerator, depends on i, and tk, not merely the difference U — tk- 
Example 10-2. A stochastic process is described by 

1 A” (0 1 = A sin t 4- B cos t 

where A and B are independent random variables with zero means and equal standard 
deviations. Show that the process is stationary of the second order. 

Solution. The first and the second moments can be directly computed: 

A'[A(0] = EiA sin i + B cos t) 

= sin iE(A) -h cos tE{B) = 0 

J5^[A'’(ii)A(^2)] = E[{A sin t\ B cos ti){A sin h 4" Bqo^ I 2 )] 

— E[A^ sin ti sin fz 4- AB sin (h + ^z) + B^ cos h cos / 2 I 
= sin /i sin t-^EiA^) -h sin (fi -h U)EiAB) -f cos h cos t^EiB^) 

— cos (/i — ^z) 

The second-order central moment depends only on h — U) therefore the process is 
stationary. (In this example it is easy to show that the second-ord(‘r probability 
distributions also depend only on the time difference l\ — h. Thus, even according 
to the more restricted definition of the term, the process is stationary of the second 
order.) 



STOCHASTIC PROCESSES 347 

10-6. Ergodic Processes.* In order to define an orgodic; stochastic 
process X{t)j a new associated random variable will first be defined: 

X = lim + A "(f4. f2) + • ’ • + Xjtk^r) 

r->« r + 1 

for a discrete process (10-20) 

1 

and Xe = lim / X{t) dt for a contiiiiious process 

r-.ce J-T 

This random variable is somewhat indicaiive of the average values of all 
the occurring random variables of the family. l<\)r stationary processes 
it is generally true that this limit exists. Furthermore, assume that, 
for a large value of r, the abov’^c random variable does not further indicate 
any randomness but approaches a constant number, h'or an ergodic 
process these reciuircments must be met. For every outcomes of the 
ergodic i)rocess the time average A^. should exist and should ecpial the 
expected value of any specific sampled random variable of the sequence : 

E[xm = (A.) for every k 

According to the symbols adopted in the previous section, it may be 
stated that the first-order ergodic property of a stationary process implies 

X{tk) = {X{t)) for every fc (10-21) 

Similarly, the ergodicity of the second order implies 

X{h) ’ X{k + r) = {X{t} ■ X{t + r)) (10-22) 

From the mathematical point of view, the above “definition” for 
ergodicity remains somewhat incomplete. To be more exact, one should 
qualify the equality of the ensemble average and time average for the 
stationary process with the reservation “almost everywhere.” Th(^ 
latter t,erminology has a specific mathematical meaning which will not be 
discussed here. The reader interested in the practical application of the 
theory may rest assured that he is not faced with such unusual circum- 
stances in the study of ordinary physical systems. 

Another mathematical point to be brought out here is the fact that, 
while the most important implication of ergodicity has been stated, there 
is a tacit omission of the delicate mathematical definition for such a 
process. The readers interested in a more precise definition may find 
the following presentation more satisfactory. Let 

[X} = (. . . ,A_i,Xo,Ax,X.„ . . .) 

* The material of this section was communicated in essence by Dr. L. Cote. The 
section may be omitted in a first reading. All equations referring to ergodicity imply 
“almost everywhere.'^ 



348 


SCHEMES WITH MEMORY 


be a random sequence of a discrete stationary process, and 2 / 2 , . ■ . , 2 /r) 

be any real single-valued function of r variables. Now define a function 
of r sampled random variables of the sequence 

N 

G(N,X) = ^ . . . ,X,,J 

k^O 

Let lim G{N,X) = G{X) 

N —* « 

The function G{X) in the limit may or may not exist. If it exists and has 
a constant value G independent of X, and moreover if this constant value 
of G is equal to E[g{X i,X 2 t . . . ,Xr)] for all choices of sequence Xk, 
Xk+ij • ■ * , Xk^rj then the sequence is called ergodic. \ 

It is easy to see that the above-mentioned two cases are encompassed 
by this more general definition. In fact, letting ' 

9iyhy2, . . . jijv) = g(y) = y n = l 

there results 

AT 

n-() 

lim G{N,X) = E{X) 

The process lim used here differs from the usual limit since 6 is a function 
of a random variable. (For a discussion of the concept, sec Sec. 10-12.) 
Similarly, the second-order ergodicity can be illustrated by letting 

g^VhVi, ■ ■ ■ ,2/n) = giUhVi) = Ivi - E{X)]\y 2 - E{X)] 

N 

G{N,X) = ^ [X. - EiX)][X„K - E{X)] 

n = 0 

Since the process is stationary to begin with, it follows that 


lim G{N,X) = p{K) 


p(K) does not depend on X* but depends only on the time interval K. 
The foregoing defining procedure employs the concept of a discrete ergodic 
process, for convenience. The same conceptual pattern applies for defin- 
ing continuous ergodic processes. 

During the past three decades a vast amount of mathematical work 
has been produced on the subject of ergodicity. The famous Birkhoff^s 
ergodic theorem is a classical landmark in this specialized field. The 
interested reader is referred to specialized articles on the subject. 



STOCHASTIC PROCESSES 


349 


The diagram of Fig. 10-3 is a reminder that the ergodic processes as 
defined here are a subclass of the stationary processes which are in turn 
included in the more general class of stochastic processes. This intro- 
ductory presentation is not concerned with nonstationary processes. 



Fig. 10-3. A classifictitioii of different processes as (ILscussed in the text. 

In fact, the subs(‘quent study of linear systems will be confined principally 
to ergodic processes. 

Example 10-3. Determine wludlicr or not the stochastic process 
{JV(0 I = A sin t -h cos t 

is erpodic. A and /i arc normally distributed independent random variables with 
zero means and equal standard deviation. 

Solution. The second-order central moment of the process was computed in Exam- 
ple 10-2: 

^^[A''(<i)X(< 2)] = cos {ti — ti) 

In order to obtain the time avcirage, consider two specific members of the ensemble: 

A 1 sin t -\- Si cos t 
and A 2 sin t S2 cos t 

wh(!re Ai, A 2 and /?], flo are some specific permissible values that the variables A and B 
may assume. Next compute the second-order time average. 

Time average = ((A , sin t -f- Bi cos 0-^2 sin (< + t) -f B2 cos {t -h t)> 

It can be shown that this time average depends on the selection of the member in the 
ensemble. In fact, let t = 0, and inspection will show that 

Time average = (A]A2 sin* t) (B1B2 cos* t) = -f B1B2) 

It has been found that there is at least one second-order time average that depends on 
the selected members of the ensemble, rather than being a constant. Thus the process 
is not ergodic although it is stationary. 

10-6. Correlation Coefficients and Correlation Functions. This sec- 
tion begins with the introduction of the correlation coefficient p which is 
commonly used in the study of the interdependence of two random 
variables X and F. 

Correlation coefficient = p = (10-23) 

Ve[{X - Xy] ■ E[{Y - F)^l 



350 


SCHEMES WITH MEMORY 


The numerator of this expression is called the covariance of the two varia- 
bles. If the two variables are independent, their covariance is zero. If 
the correlation coefficient is zero, the variables are uncorrelated, but not 
necessarily independent. (This of course does not imply statistical inde- 
pendence.) In the same trend of thought, it is natural to extend this 
useful concept to two sampled random variables of a simple or a joint 
process. 

Letting [X{t)^Y{t)] be a joint stochastic process, the covariance of the 
two sampled random variables X(l^) and F(4) is defined as 


Covariance - X(0][F(« - F(/,)]) (10-24) 

When the two sampled random variables are selected from a single 
proctiss, the covariance coefficient is more specifically called thet\auto- 
covariance. Tlu^ aiitocovariance indicates a measure of interdepenaence 
or coherence between the two sampled random variables of the prcicess 
X{t^) and X{h), that is, \ 

Autocovariance { X (U) ,X (U ) ) 


= AM[X(fO - X{Q][X(U) - X{h)]\ (10-25) 

An obvious simplification occurs if a convenient choice is made: 


A(0 = 0 X(tk) = 0 
TTU = 0 F(4) = 0 


(10-26) 


This ‘ ^simplification can be considered as an effect of a linear change of 
the variables. Such an operation has no significant effect on our studies 
except the .simplification of results. In such a case, 

Covariance of \X{U),Y{tk)] = E[X{U) ■ Y{tk)\ 
Autocovariarice of { A'(0,A^(^)t) ) = E[X{Q ■ X(^a:)] 

For a stationary process of second order with a density function ]Ujtk)y 
the autocovariance depends only on the time lag U — tk and gives a 
measure of the effect of the past of the process on its present and future 
states : 


E[\X{t:) - XiU)][X{h) - X{h)]\ = RU\U - /.|) (10-28) 

The autocovariance of a stationary process is sometimes called the 
autocorrelation function in engineering literature. This terminology is 
not generally in agreement with the mathematical definition of correlation 
coefficient : 


- xmxju + r) - X{h + r)l| 

Ve[xIu) - X^)YE[X(u + r) - XWT^W 



STOCHASTIC PROCESSES 


351 


However, if the assumption is made of zero first-order averages for X(U) 
and X(ifc), there occurs 


^ E[X(Q ■ + r)] 

tTP 

_ E[X{U) ■ X(u + t)] _ - /,|) 

Ruoy ~~ i?x.(or 


(10-30) 

(10-31) 


Under the above assumption the autocorrelation function will be identical 
with the autocovariance within a constant. 

In order to avoid confusion, the normalized autocovariaoce func.tion 
of a stationary process, that is, Rxx(t) / R xx{0) , will be called thc' auto- 
correlation function of that process. This will be in conformity with the 
engineering literature. It will be tacitly assumed that the first-order 
averages are zero and a normalization is done to make 


RxxiO) = 1 


(10-32) 


With this reservation in mind, the autocorrelation of a stationary process 
function is 


Px/(r) = E[X{U)X{1, + t)] = ^ri.r 2 f{x^,X 2 ) dx] dx^ (10-33) 

Similarly, for a joint stationary process [X{t),Y{t)} a cross-correlation 
function may be defined as 

p.,Xr) = /WO - m + r)J (10-34) 

To sum up, the autocorrelation and cross-correlat ion functions are special 
covariances for stationary processes of the second order or higher, pro- 
vided that the nutans of the sampled variables are taken to be zero and 
RxAO) = 1, /t:.,(0) = 1. 

Finally, since later studies are confined to ergodic ensembles in linear 
systems, th('re is oiu^ more simplification ahead, the one for ergodic 
ensembles with zero first-order averages: 

pUt) = X{t) ■ X{t -h r) = (X(0 ■ X{t -h r)> 

pUt) = X{t) ■ nt + r) = {X{t) ■ nt + t)) ,,,, 

pUt) = nt) ■ X{t -h r) = {Y(t) -Xit + r)) 

Pyy{r) = Y(t) ’ Y{t + r) = {Y(t) ‘ Yit + r)) 

In the study of ergodic ensembles, the correlation functions may be 
computed from either probabilistic considerations (joint density of 
samples) or from the deterministic point of view, that is, the time aver- 
aging of the two specific time functions under consideration. The 
equivalence of the two results is assumed by ergodic hypothesis. When 
dealing with two or more sampled random variables of a simple or a 



352 


SCHEMES WITH MEMORY 


joint stationary stochastic process (t units of time apart) a good picture 
is obtained of the interdependence of the different samples by computing 
the correlation functions and tabulating the results in a useful matrix 
form. For example, for a joint stationary process |X(f), F(0), we have 


Pzx(r) Pxyir) 
Pvxir) pyyir) 


(10-36) 


When dealing with ergodic ensembles, compute the correlation functions 
based on the concept of time averaging; that is, form the product of one 
member of the ensemble by another member shifted by a time interval 
T. The product should be averaged as indicated previously, j Care 
should be exercised, however, to determine that the process is ergcjdic to 
begin with, and this is not generally a simple task. 

10-7. Example of a Normal Stochastic Process. Consider thd 
chastic process 


th\ s 


sto- 


|X(0} = { ^ cos -h ^ sin oj/ 1 


(l(i37) 


A and B being normally distributed random variables with zero means, 
unit variances, and zero correlation coefficient; then 


X{u) = Xr = A cos 03U -b B sin ojU = a^A + PiB , . 

X(ti() = Xk — A cos Cistk ”1" B sin cotk = otkA ^kB 


The density distribution for the random variable is normal, as the 
variable is the linear sum of two normal variables. The same is true for 
the distribution of the random variable X*. 

Xi normal with zero mean = \/ =1 standard deviation 
Xk normal with zero mean ak = \/ cxk^ + <9^* = 1 standard deviation 


The joint density distribution of X^ and Xk is a two-dimensional normal 
distribution also with (0,0) mean and a covariance matrix which can be 
determined in the following way: 


LP21 P 22 J 

P12 is the covariance coefficient between Xi and Xk: 

pn = E[{Xi - XMXk - X*)] = EiXiXk) 

= A^ cos u)U cos u)tk + AB cos sin (atk 

-1- BA sin iaU cos (atk + B^ sin (aU sin <atk (10-39) 


Since A and B are uncorrelated random variables, 


E{AB) = E{BA) = 0 



STOCHASTIC PROCESSES 


353 


Thus 

Pi 2 = COS COS (J)tk E{A^) + sin w/* sin cotk E(B^) ( 10 - 40 ) 

Pl 2 = cos w(^i — ik) 

1 cos aj(<i — tk) 

coso3{U — tk) 1 

Now the joint distribution of X^ and Xk can be found by a direct sub- 
stitution of pertinent quantities in the quadratic form associated with a 
two-dimensional normal variable. This distribution function is normal 
with (0,0) mean and covariances which depend solely on the time differ- 
ence U -- tk'j that is, choice of the origin of time is irrelevant. Thus 
the process is stationary of the second order. Meanwhile it has been 
shown in the above that the normalized correlation function pi 2 depends 
only on the time difference — tky as it should be in a second-order 
stationary process. The above normal stochastic process can also serve 
as an example of a strictly stationary process; that is, the density dis- 
tribution /„ for any number n of sampled variables remains intact under 
any translation of the time scale. The proof is left for the reader. 








H 




1 











1 




1 



1 

1 

1 

L ■ 1 ^ J 




I 

Fig. 10-4. A Poissoii-type pulse process, described by Va\h. (10-42). 


10-8. Examples of Computation of Correlation Functions. Consider 
the following stochastic process, which has many engineering applica- 
tions. A pulse generator transmits pulses with random durations but 
with heights of +E or —E. The occurrence of a pulse is supposed to 
follow the Poisson law, with k/2 the average number of pulses in the unit 
time. 

In order to compute the autocorrelation function of this process, the 
familiar probabilistic definition of the ensemble averages is employed: 

Pii(r) = i;[X(^i) ■ X{ti -h r)] = 2SxiX2/(xi,X2) (10-41) 

The variables are of a discrete nature, each one assuming binary values of 
plus or minus E. It is necessary to have the joint probability function 
for the two variables. For this, the probabilistic nature of the problem 
should first be clarified. 

The mean number of zero crossings in a time interval T is kT. The 
probability of getting a specified number n of crossings in a time inteiwal 
T assumes the Poisson distribution with an average of X = kT, The 
probability of n crossings in time T is [(fc!r)”/n!]e"*^. Thus the proba- 



354 


SCHEMES WITH MEMORY 


bility of having +E at time t + r (provided that there was +E at time t) 
is equal to the probaV^ility of having an even number of zero crossings 
between the two instants of time. 


P\+E at i and +E at < + t} = P{E at t] X PIE at t + t\E at t\ 


71 = 0,2,4 


P{ +E at t and — ^ at / + rj = ^ 


( 10 - 42 ) 


71 = 1,3,5, . 


Note that the joint probability function depends only fin the tirnellag r. 
The factor 3^2 appears as a consequence of the fact that, on an av^age, 
the probability of having -\-E or —E is assumed to be efpial. Thus 


pn(r) = ■ X{h + r)) = ^ 


- y 

2 -/ til 


\ 


( 10 - 43 ) 


pu(r) = A’V-*- 


1 /- 4. 


+ 


(] 0 - 44 ) 


111 this cfiuation it is tacitly assumed lhat t is a ])ositi\f‘ number. The 



same procedure of course applies when r is imgative. To encompass the 
two cases, write 

Pn(T) = A;2c-2^lr! 

The process is indeed a stationary process of the second order, and the 
normalized autocorrelation function E[X{ti) ' X{ti + t)] remains invari- 
ant under any translation of the time axis. A sketch of the autocorrela- 
tion function is given in Fig. 10-5. 

As a second example, consider a particular noise process which has 



STOCHASTIC PROCESSES 


355 


four distinct equally probable outcomes, as illustrated in Fig. 10-6. It is 
desired to compute the following data: 

1. First-order probability distribution F[X(01 for 1 = 2 

2. First-order probability distribution F[X(0] for t = 4 

3. Second-order probability distribution F[X{ti)^X{t 2 )] for = 2, 
^2 ~ 4 

4. E\X,l E[X,l E[X2 • XJ 

5. Autocorrelation coefficient Pxx[X 2 fXA] 

The probability distributions required in parts (1) and (2) are shown 
in Fig. 10-7a and 6, respectively. 





Fig, 10-7. (a) The C'DF of the random variable X 2 of Fip;. 10-6; {h) the CDF of the 
random variable X 4 of Fip. 10-6. 

The following probability “density” table can easily be obtained: 


x.X 

1 

2 

4 

6 

3 

% 

% 

% 

M 

5 

Ht, 

Ho 

Me 

Me 

6 

Me 

Me 

Ke 

Me 





356 


SCHEMES WITH MEMOBY 


The probability distribution function can easily be determined from the 
above ; for example, 

P{X 2 < 2, X4 < 3} = M 

The expected values are derived below: 

P[X 2 ] - M(1 + 2 + 4 + 6) = 

E[Xi] = 3^-3 + *5 + - 6= 

E[X 2 ■ X 4 ] = ■ 3 + • 6 + ' 12 + ■ 18 + • 5 

+ Mfi ■ • 20 + ■ 30 + Me ‘ b 

+ Me ‘ 12 + Me * 24 + Me ' 36 = ^^Me 

The desired autocorrelation coefficient can be computed from the relation 

= _^2 • X4] - E [X2] • E[X,] ^ 

The compulation first requires a knowledge of the second-order expecta- 
tions of the type ' 

E[(X - 1)*“] = E[X^ - [/^(X)]'' 

This can be done in a direct manner. 

10-9. Some Elementary Properties of Correlation Functions of Sta- 
tionary Processes. The following simple properties can be directly 
established : 

pu(0) = 1 > |p„(t)1 (10-45) 

pii(r) = Pn(-T) (10-46) 

Proof. For a stationary process the normalized autocorrelation func- 

tion pii(t) depends only on t; hence 

pxi(t) = E[z. - x.][x.+* - x:r>,] 

= F;[X._, - X,_*][X. - XJ = p,i(-r) (10-47) 

In order to show Eq. (10-45) (Schwartz’s inequality), assume zero first 
averages and start from an obvious inequality: 

X[X(0 + X(« -I- t)]* > 0 

E[X{t)]^ ± 2E[X{t) ■ X{t + t)] + E[X{t + t)P > 0 
Pn(0) ± 2 pu(t) -|- pu( 0) > 0 
Pii(O) > 1 pu(t)| 

The normalized autocorrelation of the sum of a number of stationary 
processes follows the symbolic nilc for the ordinary product. For 
example, let 

z{i) = x(o -1- nt) 

p,.{ti,h) = + F(«,)][X(f2) + y«2)} (10-48) 

= ElXih) ■ X{h)] -t- £[X(<0F(1,)] -I- E[Xih) • r(ti)] 

-I- E[Yih) ■ F(li)] 



STOCHASTIC PROCESSES 


357 


The first averages are generally assumed to be zero : 

EX{U) = EXih) = EY{h) = EY{t2) = 0 

Then 

” Pxx(,^l}^^ “1" Pxyii'lf^^ “1“ Pvx(j>\)^^ ”t" Pvy(^l?^'2) 

Since this chapter is confined to stationary processes, write 

Pzzir^ ~ Pm(t) "h Pxyir) ~Y Pvx(t) "h Pvyir) (10"49) 

Therefore, under the above assumptions, the rule for deriving the normal- 
ized autocorrelation function of the sum of a number of stationary pro- 
cesses will symbolically follow the rule for the ordinary jiroduct of the 
sum of a number of terms; for example, 

{X + Y){X + Y) = XX + XY + YX + YY 
(X +Y + U)(X +Y+U)= XX + XY + XU + YX (10-50) 

+ YY + YU + UX + UY + ULJ 

When two processes [X] and (F) are independent, then pxyir) = 0. 

10-10. Power Spectra and Correlation Functions.* The readers are 
undoubtedly familiar with the significance of linear integral trans- 
formations (particularly Fourier and Laplace transforms) in engineering 
problems. The dual relationship of the frequency and the time domain is 
a most significant concept in the study of linear systems. Whenever a 
linear problem becomes involved in one of these domains it is possible 
alternatively to try to solve the problem in the other domain. In this 
respect the reader may recall the familiar development of network theory 
such as the concept of impedance functions and network synthesis pro- 
cedures. These developments have taken a primary place in the light 
of the theory of Laplace transformation. 

It is natural to develop and explore relationships between specific time 
averages of interest and their Fourier integrals. Such a development 
mathematically is very fruitful; meanwhile, from an engineering point of 
view, the idea of Fourier integrals, amplitude, and power spectra has 
certain physical significance. 

For a real second-order stationary process (wide sense), the power 
spectrum is defined as the Fourier transform of the normalized auto- 
correlation function of the process: 

0x*(w) = j Pxx{T)er^^ dr (10-51) 

Px.(r) = ^ dw (10-52) 

provided that the integrals exist. Since the autocorrelation function is 

* The author wishes to make acknowledgement to Drs. R. A. Johnson and S. 
Jutila for valuable insights gained in discussions with them on the stochastic 
behavior of linear systems. 



358 SCHEMES WITH MEMORY 

an even function, it may alternatively be written 


Pll('r) 



Pxx{t) cos wt dr 

<t>xx{(*i) cos OJT dcjj 


(10-53) 

(10-54) 


These equations are sometimes called Wierier-Khinchin relations. 

As an application of the Wiener-Khinchin relation, we compute the 
power spectrum of the Poisson process of Sec. 10-8. 


x(cj) = 2 






K^P 

4X2 


When dealing with two such processes, the cross power speeVrum, 
which has a similar relationship to the cross-correlation fuiic.tion,\may 
also be defined.* 

Next, conside^r an ergodic process 1-Y(0} and define an average power 
associated with any member of this ensemble. Let .V(0 be a specific 
member of the process and assume X{t) to be a real-valued function of 
time. The average power associated with the truncated during the 
interval + T) is 

Average power in time I \Xr{t)]^ dt (10-55) 

ZJ J-T 

If Xrit) were a current flowing into a 1-ohm resistor, the above expression 
would indicate the average power dissipated in that resistor. The 
average power associated with an ergodic proc(\ss can be defined as 

<[X(()P) = lim r [Xt(«)]“ dt (10-56) 

y ^ Z1 J-T 

This average will be the same for almost all members of the ergodic ensemble. 

Now let Frijo}) be the Fourier transform of the truncated time func- 
tion Xrit) for a specific mernVjer of the ensemble: 

Xrit) =0 t -\-T 
Xrit) =0 t < - r 

FtUu) = (10-57) 

The immediate purpose is to derive a relationship between the average 
power and the function F(yaj), where 

F{joj) = lim FtU^) (10-58) 


* W. R. Bennet, Methods of Solving Noise Problems, Proc. IRE, May, 1956, 
pp, 609-638; J. H. Laning, Jr., and R. H. Battin, “Random Processes in Automatic 
Control,” McGraw-Hill Book Company, Inc., New York, 1956. 



STOCHASTIC PROCESSES 


359 


assuming such a unique limit exists for all members of the process. To 
this end, the power spectrum of a stationary process can be alternatively 
defined as 

0xz(w) = Um ^ E[\Frijo3)\^] 

Subsequent to some algebraic manipulations, one can vshow that, for an 
ergodic process, this definition is consistent with the one given earlier. 

By Plancherers relation, we have 

^lim 2 ~ 1’^ = lim ^ J'^^[XT{t)V dt = ([X(t)l^> 

(10-59) 

Using the defining equation of power spectrum and the relation 

Pxx(O) = {[X{m = 1 do, < =0 (10-60) 

we find that* 

<t>TxM = \F{jco)\^ (10-61) 

For an crgodic process the power spectrum is the limit of the square of 
the magnitude of the Fourier transform of the truncated time function 
Xrit) when T is increased indefinitely. This is a unique deterministic 
function for all outcomes of the process. 

Finally it is to be noted that the power spectrum of the sum of several 
independents processes is the sum of their individual spectra. In this 
sense ^'linearity” holds, as in the case of the autocorrelation function of 
the sum of a numbc'r of independent processes. 

Thus it has been shown that there is a close relationship between power 
spectrum and correlation functions. When dealing with problems of an 
ergodic nature, it is convenient to use either correlation function or power 
spectrum, depending on circumstances. 

10-11. Response of Linear Lumped Systems to Ergodic Excitation, f 
Electrical engineers are quite familiar with the study of ordinary linear 
lumped bilateral networks under a periodic deterministic regime. In such 
systems, when initially relaxed, if a unit impulse excitation Uo{t) applied at 

* The assumptions JeadiriR to Kq. (10-61) are involved with mathematical com- 
plexities not given here (see Middleton, Chap. 3; and Davenport and Root, Chap. 6). 

t The statistical design of linear systems has been developed during the past two 
decades. Among tliose who have made significant contributions to this development 
arc Wiener, Kolmogorov, Shannon, Rice, Bode, Middleton, Zadeh, Lee, and many 
others. Today the subject of filtering and prediction occupies the nucleus of a course 
in the graduate curriculum of many electrical engineering departments around the 
world. An adequate coverage of this topic, which should include the work of a great 
many scientists, is completely outside the scope of the present book. Those interested 
in a full treatment of the subject are referred to Laning and Battin, Wiener, Blanc- 
Lapierre, Davenport and Root, and Middleton. 



360 


SCHEMES WITH MEMORY 


time t = 0 produces an output of hit), then the response of the same sys- 
tem to an excitation x{t) is given by the familiar superposition integral: 

Output = y{t) = X{t) • h{t — r) dr 

with h{t) =0 for ^ < 0 (10-62) 

This is due to the fact that a unit impulse Uoit — r) applied to the 
system at time t = r will give rise to an output of h{i — r). 



The use of Fourier or Laplace transforms is most natural. In fact, 
the convolution of two functions in the time domain corresponds to the 
product of their Laplace transforms: 


= T{s) 

Mxit)] = A(s) (10-63) 

St{y{t)] = B{s) 


Then for an initially relaxed system, 


B{s) = A{s)- ns) 


(10-64) 


From this latter relation the inverse Laplace transform may be computed; 
thus, the problem is at least theoretically solved. The above funda- 
mental relation is perhaps the most significant equation of the linear 
system theory. 

The principal aim in this section is to explore the possibility of deriving 
a fundamental relation for the performance of a linear lumped bilateral 
system under ergodic regimes. Fortunately, an equally simple and ele- 
gant relationship exists. This simplicity is the main reason for the 
existence of the vast literature on the subject. 

Let {X(0 } be an ergodic input to a linear system initially at rest with a 
unit impulse response of h{t). For a moment, concentrate on one specific 
member of the input ensemble, X{t), The corresponding output will be a 
specific time function Y (t) such that 


r(<) = X{T)hit -T)dT = f* Xit - a)h{a) da (10-63)* 


* By letting t 


T 



X{t — a)hi(x) da is obtained. 




STOCHASTIC PROCESSES 


361 


If the specific input is shifted in time by r, the output will undergo an 
equal time translation. This will enable a computation of the auto- 
correlation function of the output ensembles. The autocorrelation 
function can be written in the form of a double integral instead of the 
product of two integrals:* 

Pyyir) = (7(0 ’ Y{t + r)) 

= /o Jo ~~ oi)X(t + T — 0))h(a) ‘ h(P) da d0 (10-06) 

Note that 


{X{t — a) ' X{i + T — /3)) = {X{t — a) ' X{t ^ a -\r T — (3 a)) 

= pUr - 0 + a) (10-07) 

But + a) = ^ I do) 

Jtt 

Consequently, 


Pyvi.r) 


= 1 /■' /■" /■“ 

2ir yo Jo J- 


• li{0) da dji doo 


Pyyir) = ^ A do) j h(a) ’ da j h{0)e~^^^ d0 


(10-68) 


According to the fundamental etpiation of linear systems, the system 
function is the Laplace transform (P'ourier transform if ,s = ju)) of the 
unit impulse response h{t ) : 


h{a)e^"*^ da = T{ — ju)) 

fj dfi = ruo,) 

Therefore 

Pmiir) = 2^ j <t>xxio>)e^ du [T{jui) ■ T(—ju)] 


(10-69) 


(10-70) 


Using the definition of the power spectrum for autocorrelation functions, 

/_» ’t>yyio>)e*" du = <^ix(«)e*'' dti> [T(jw) • T{—ju)] 

<l>yyiu) = <t>„{u) ■ T{ju) ■ T{—ju) 

4>yyW) = <l>xxiu)\T{ju)\^ 1 (10-71)t 


* The interchanging of the integral sign and the averaging sign can be justified. 
For additional proof see Davenport and Root, Secs. 9-2 and 9-Jl. 

t The use of the Fourier transform in this section is in conformity with the^mathe- 
matical literature on the subject, particularly the defining of equations of power 
spectra. This should not deter the reader from making a comparison with the funda- 
mental deterministic equation of the linear system as given previously in the notation 
of the Laplace transform. 



362 


SCHEMES WITH MEMORY 


This is the fundamental relation for the performance of linear systems 
under ergodic regimes. The power spectrum of the output can be directly 
computed from the knowledge of the system function and the power 
spectrum of the input. Statistical information about the input thus will 
lead to a statistical identification of the output. 

To sum up, the reader should feel that he has obtained some concrete 
results; namely, when faced with an ergodic stochastic input to a linear (or 
linearized) system, he may proceed with the following steps; 

1. Determine the autocorrelation function of the input. 

2. Compute the power spectrum of the input 

3. Obtain the system function 7\s) between the input and the output 

ports of the system. | 

4. Compute the scpiare of the magni- 
tude of the system fuiKjtioii for din^^reiit 
values of a;, \T{jQi)\-. \ 

f). The power spectrum of the output 
is lT(jc.)|2 ■ 

6. The autocorrelation function of the 
output process can be determined ])y 
taking the inverse Koiiriei* transform of 
the expression obtained in step 5. 

Example of White Noise. Consider the 
white noise applied to the input terminals 
of the network of Fig. 10-10. It is 
desired to give an indic^ation of the sta- 
tistical nature of the output. 

The term white noise refers to a stochastic process with a constant 
power spectrum over all frequency ranges. 

(hxAo)) = a 

(This concept implies infinite power, which is not realistic. At present, 
ignore this unrealistic implication.) The required system function is 
found to be 



Fig. 10-10. Example of a linear sy.s- 
tem under stationary rcKiine. 


^ _ LiS 1/CjS _ - 1 

Ei LiS + 1/CiS LiS + 1/CiS (C,LiS“ + '\)(CiLiS^ + 1) 


The power spectrum of the output process is 



CiLiw^ + 1 


An example of an application of Eq, (10-71) to communication theory is 
the effect of thermal resistor noise. In light of physical considerations, 
one concludes that the voltage fluctuations due to thermal noise may be 



STOCHASTIC PROCESSES 


363 


considered as a gaussian process. These consideraiions lead to the fact 
that a linear resistor of resistance R can Vje replaced by a passive noiseless 
resistor in series with a stochastic voltage soiir(;e having a flat power 
spectrum of 2KTK. (T is the temperature of the system in degrees 
Kelvin, and K is referred to as Boltzmann^s constant. In most applica- 
tions the value of KT is taken as 4 X 

The problem of the study of noise in linear systems is treated in detail 
by many authors, for example, Lawson and Uhlenbeck, Rice, Davenport 
and Root, and rreeimin. The general approach to these probU^rns con- 
sists in replacing noisy resistors by noiseless ones fed by gaussian sources. 
Consequently, at any output port of the linear system, one can calculate 
the output power spectrum by proper application of (10-71). 

Example 10-4. A stationary injint 1A"(0| with an autocorroliition function 

P,At) = Ae “I'l 

is applied to a system sueh iis tliiil described bj' tbe differential eqnation 

g'fa5+% = |X(0) 


Determine^ Ihv power spoeirurn and the variance of the oiilfiut firocess. 
Solution. The power Hpecirum of tlu‘ input is 


4>Xx(iM>) 


A Of 


Of2 -p 


The power spectrum of the output can be flirectly olitaiiu'd by the application 

of Eq. (10-71). The variaiuT of the output may lie olitained in the following manner: 


PuiAt) 

Variance of the output 


.1 f t/o) 

ZtT J — <x> 

2wj~^ 0j-i(w)|7’(iw)|^ dw 


The integration may be accomplished by consulting appropriate integral tables. 

10-12. Stochastic Limits and Convergence. For the sake* of mathe- 
matical completeness, we now supplement the introductory study of 
stochastic processes with a discussion of the concept of integration and 
difTerentiation of such processes. This can be accomplished after some 
limiting procedure for the process has been defined. 

From a physical point of view, in engineering problems, a noise process 
is generally subject to integration and difTerentiation when it passes 
through a linear system. The ordinary integrators and diflerentiators 
which are so commonly used in servomechanisms give the simplest 
examples of such situations. 

This section defines the stochastic limits and the stochastic conver- 



364 


SCHEMES WITH MEMORY 


gence. The following modes of stochastic convergence are frequently 
used: 

Convergence in Prohahility {Abbreviated Form i.p.). An infinite 
sequence of random variables, 

Xi, X,, X 3 , . . . , x„, . . . 

converges to the random variable X in probability if, for any positive 
number e, 

lim P{\Xn - X| > f) =0 (10-72) 

71— ¥ 00 

Convergence in the Mean-square Sense {Abbreviated Form m.s.). This 
convergence is defined as follows: 

l.i.m. Xn = X 

if lim 7^|X„ - X |2 = 0 

n—* oD 

where l.i.m. stands for the limit in the mean. 

Almost Certain Convergence {Abbreviated a.r.). For a.c. convergence, 
the set of realized sequences of Xi, X 2 , . . . , Xn, . . . converges to X 
with probability 1 Avhen n approaches infinity. The a.c. convergence, 
sometimes called the strong coJivergcnce , implies i.p. convergence (but not 
conversely). There are also other modes of convergence defined in the 
literature (for necessary and sufficient conditions for each mode of con- 
vergence, see, for example, M. S. Bartlett, “An Introduction to Stochastic 
Processes,^^ or J. E. Moyal, J . Roy, Stalist. Soc., ser. B, vol. 11 , no. 2 , 
1949). 

The above concept of convergence applied to a discrete sequence can 
be extended to the case of stochastic functions. For example, for a 
process {X(f) } , define X(t) as the stochastic limit for X{t), abbreviated as 

l.i.m. X(/) = Xo(t) 

when it satisfies the condition 

lim E[X{1) - Xo(t )]2 = 0 (10-74) 

This is the definition for stochastic m.s. convergence of the process. 
(There are also a.c. convergence and convergence in probability for 
stochastic functions, but a discussion will not be undertaken here.) 

Example 10-6. Consider the tossing of an honest coin. Let X M be the number of 
heads in n throwings divided by n; that is, X(n) is a random variable obeying the 
binomial distribution : 

E[X(n)] = I'i average 

(r*[X(n)] = ~ standard deviation 




STOCHASTIC PROCESSES 


365 


It can be shown that the sequence 

X(J), A' (2), . . . , A(n), . . . 

converges, in the sense of 1, 2, and 3 below, to ^2- 

1. By Chebyshev’s inequality (Feller, p. 219, or Loevc, p. 14): 


In the limit: 


For n — ► « : 


P||X(n) - IA\ > ‘I < 


P||A» - J.^1 >«| =0 

71—* 

E[X(n) - li]^ = ^ 

4n 

l.i.in. A(n) = 

Iim F;[.V(n) — j.^]2 0 


3. Using the strong law, one can also prove that [A(n)l eon verge's <o in the a.c. 

n— * w 

sense. The proof is somewhat complex and will be omitted hero (see Loeve, p. 19). 

10-13. Stochastic Differentiation and Integration. The rigorous treat- 
ment of modes of convergence, continuity, dilTerontiability, and integra- 
bility for stochastic processes is beyond the s(!ope of this text. For the 
immcidiate purpose, it seems sufficient to give rudimentary definitions 
similar to the concepts actpiired in ordinary courses in analysis. 

A real process X(t) is said to be continuous at a time t in m.s. if 

iim E\X{t + h) - X(t)\‘^ = 0 (10-75) 

/ i -*0 


The m.s. derivative of a continuous process shown l)y X{t) is defined as 


l.i.rn. 

/ t -*0 


X{t + h) - X{i) 
h 


X'{t) 


(10-76) 


As in the case of the integration of deterministic functions, different 
types of stochastic integrals can be defined. The lliemann stochastic 
integral as well as the Lebesgue and the Stieltjcs stochastic integrals is 
frequently used in the technical literature. At present, this discussion is 
confined to giving the definition of stochastic integrals in the Riemann 
sense. 

Divide the real interval of integration (a,/>) into n arbitrary sub- 
intervals dti. The Riemann integral of a stochastic process {A(/)) can 
be approached in the following way: 


X{t) dt 


l.i.m. 

71— » W 


I 

1 


X(<,) dt, = R{a,b) 


(10-77) 



366 


SCHEMES WITH MEMORY 


This is contingent upon the convergence of the sum into the function 
Ria^h) in the in.s. sense. It can be shown that the integral exists if the 
double integral of the autocovariance function of the process, Pxx(tuh), 
exists over the square [ia,h) by (a,h)] in the iti,t 2 ) plane (see J. E. Moyal, 
J. Roy, Statist. Soc., ser. B, vol. 11, no. 2, 1949, pp. 168-169). 

It is not intended to delve further into this problem; however, it is of 
interest to note that, once the concept of differentiation and integration is 
introduced, the territory of the subject of discussion can be extended to 
cover the stochastic differential, integral, and difference equations. For 

example, the differential equation 


n Li 



Fio. 10-11. I'A’iiTTiplo of an RhC network 
under stochaatic roKiinc, 


relating current and voltage in an 
RLC series network is 


^ dH , j^di . 1 . de 

If the driving voltage is a stocl^astic 
process {E(0), the current respVnise 


will also be a stochastic process | / (/) ) , 


the two processes being related by the stochastic differential equation 


+ /e{y(0l |/(01 = (^(0) (io-79) 


The solutions to such c(iuations are somewhat similar to the solution of 
their counterparts in the ordinary theory of differential eciuations. 

The different moments and correlation functions of a stochastic process 
and its derivative are interrelated. Let {X(0) be a process and {X'(01 
its first derivative. Then, with some mathematical care* it is possible to 
show that 

x'{t) - i [x{t) - r( 0 ] 


E[X’{U) - Turn'd, ) - A-'(/,)] = E[X{t,) - X((.)][X(«,) - X«,)] 

rj y# # ^ _ d^Rxxjttytj ) 


Thus, the correlation function of the derivative is equal to the second 
mixed derivative of the correlation of the original process. Furthermore, 
if {X(0} is stationary it follows that its derivaitive is also a stationary 


process. 


_ d-Rxxir) _ d^Rxxir) 


*The prorpse 1X(01 is assumed to have a derivative and differentiable averages 
and correlation function. Although an informal derivation of the above relations 
is straightforward, a rigorous derivation is somewhat involved with convergence 
considerations. 



STOCHASTIC PROCESSES 


367 


Similar observations can be made about an integral of a random process. 
Let {X{t)} be a known process and .f(«,0 a suitable kernel. We are 
interested in studying the integral process ( F(s) ( . 

{Y{S)] = JM0{X(01 dt 

Taking into consideration some mathematical concepts (existence of 
finite averages, correlation, basic definition of the process etc.) 

which are not covered here, one finds 

YM -YW) = - XiTn dt 

E[ns.) - Y{sd][YM - Y{8,)] 

= fffMIM^Xitd - X[t,)\[X{t,) - X(/,)| du dt, 
= i du dtj 

10-14. Gaussian-process Examples of a Stationary Process. In the 

study of the nois(' in communication systems, the engineer is frecjiiently 
faced with periodic procc'sses which have converging hourier seri(*s expan- 
sions in a given interval, say (— 7\T'): 

00 

X{t) = ^ Ua cos ojkl + hk sin ujkt (10-80)* 

k = l 

a/c and b, are generally mutually independent random variaVdcs for all 
positive integers k and j. In many instances it is convenient to assume 
that random variables and h, have normal distributions, with zero 


means and 

specified variances: 

(I) 

E{a,) 

= 0 k= 1,2, . 


E{h) 

= 0 

(II) 

E(a, hi) 

= E(a,) ■ E{bi) = 0 

(III) 

E{ai^) 

= Eibi'^) = 


^ < 00 
1 

The first objective is to show that such a process is stationary. For this, 
compute the second-order joint density function at times, say, ti and ^ 2 . 
According to the central-limit theorem, the random variables X(ii) and 
X{t 2 ) will approach normal distributions having zero means and equal 
standard deviations. Furthermore, a bivariate central-limit theorem 
will show that these two normal random variables also have a joint 
normal distribution. Under assumption III, the standard deviation 
of each term (Uk^bk) of the sum X(t) will be independent of Thus the 
joint density of X(U) and X{t 2 ) is independent of ti and t 2 . It can also 

* D-c component assumed to be zero. 



368 


SCHEMES WITH MEMORY 


be similarly concluded that the joint density for any number of sampling 
points depends only on the pertinent time differences and that the 
process is strictly stationary. 

Now compute the autocorrelation function of the process: 

« 00 

E[X(ti)X{t2)] = X 2 cos mcot] cos Jlwt2\ 

VI ~ I rt = 1 

+ E[b„,h„ sill moiti sin na;/ 2 l} 

eo 

= ^ {E[aJ cos nojti cos ncot 2 ] (10-82) 

1 

+ E[bn^ sin nojfi sin ^ 

00 I 

E[X(ti) ■ -X'(^ 2 )J = ^ E\aJ^ cos 7io3{ti — 12 )] 

1 

By letting ^2 = ^ 1 + r, 

00 

Ii[X{h) • X{U + t)1 = p„(/.,, U+t) = pUt) = £ cosntoT E{a,.^) (kVs:}) 

1 

This function, of course, depends only (;n r. The process is stalionary 
of second order. 

This procc'ss is an example of a larger class of processes which are 
generally calhid stationary gaussian in the literature. 

10-16. The Over-all Mathematical Structure of Stochastic Processes. 
This section will give an over-all picture and a summary of the discussed 
stochastic processes. The major source of information for the content of 
this section is the talk by J. L. Doob which was given before the Inter- 
national Congress of Mathematicians at the Amsterdam meeting in 
1954. According to this source, a satisfactory definition of a stochastic 
process is that it consists of a family of random variables. (In most 
cases, the variables are real-valued functions; this is referred to as the 
standard process.) 

Standard Stochastic Processes. As discussed previously, these processes 
are defined for the T : 

1X((), t G T] 

where T is the real space. The joint distributions of the finite sets of 
the random variables of the process at different times are given by 
definition of the process: 

X{tl)f X{t2)y . . . y X(tn) 

The standard process is stationary if, for every finite parameter set 
^ 1 , t 2 , ... y tny the joint distribution of 

X{tl + h)y X{i2 + h)y . . , y X {t^ K) 

does not depend on the number h. 



STOCHASTIC PROCESSES 


;3()0 

Standard Process with M utually I ndcpcndent Random Variables. 
Consider a process {X{t)j t ^ T\ , ti < (2 ■ • ■ < tn is a set of param- 
eter points on T and if the variables 

X{h) - X{h)] X(t,) - X{t2); . . . ; X(tn) - 

are mutually independent, then the process is said to be a standard process 
with mutually independent random variables. The most important sub- 
class of these processes is the Brownian-motion process. In a Brownian 
motion, the variable X(4) — X{tk- i) is assumed to have normal distribu- 
tion with zero mean and variance that is proportional to the parameter 
tk — tfr-i for all values of the integer /;. 

Standard Markov Process. The standard process 1>Y(/), t ^ T] is said 
to be Markovian if the conditional probability of a future state depends 
only on the present state but not on the past history of the process. In 
mathematical language, for h < to < ‘ ■ ' /«, 

p\x{tn) e A\x{t,),x{b^, . . . , 

= r\X{U E A\X{t,.,)\ (10-84) 

A Markov process is defined by the conditional probability distribution 
for /c = 1, 2, . . . , n together with /1-Y(/i)], the initial 
probability of the process. 

The so-called Markov chain which frecpieiitly appears in the literature 
of communication engineering is a particular case of the standard Markov 
process having a finite number of discrete states: 

^^ 2 , . . . f Sn 

The probabilit y of the process going to the /dh state depends solely on the 
immediate preceding state, This information is generally conveyed 

either by a transition probability matrix or by a state diagram. 

A typical question on Markov chains is how to determine the probabil- 
ity of reaching state j from state k in m steps. The probability of the 
process being initially in a certain state along with the transition proba- 
bility matrix provides a simple answer for this type of problem. 

Standard Martingale and, Semimartingale. About 1940, J. L. Doob 
studied the concept of martingales and semimartingalcs. A process 
{X{t)y t ^ T] is a martingale if the conditional expectation of its future 
state, given the past and present, is equal to the value of its present state. 
More specifically, for b < ^2 ' ■ * < b, 

ElX(Q\X(tO, . . . ,A"(b-i)( = X(b-i) (10-85) 

with probability 1. 

For defining a semimartingale, replace the equality by inequality (>). 
Not much application of these processes has yet appeared in the technical 
literature except in connection with information theory. 



370 


SCHEMES WITH MEMORY 


10-16. A Relation between Positive Definite Functions and Theory of 
Probability* 

Definition of a Positive Definite Function.^ A continuous function 
f{x) real on the x axis is said to be positive definite if it satisfies the 
following requirement. Let xi, X 2 , . . . , x„ be n real numbers and 
ai, a 2 j ... j an complex numbers. Then for all values of n > 2 it is 
required to have 

n n 

2 2 ^ 0 (10-86) 

This definition can be interchanged with the following when a(x) is a 
continuous function in a given interval, a < x < b: j 

j ‘‘ f{x — y)a(x) ■ a{ij) dx dij > 0 ^0-87) 

Bochner's Theorem. In 1932 R. Bochner introduced the followinglbasic 
theorem: Any positive definite function /(x) can be represented by the 
Stieltjes integral: 

/(x) — J ^ dF{y) — oo < x < + oo (10-88) 

F(y) being a real bounded nondecreasing function. Conversely, any 
function represented by such an integral is a positive definite function. 
On the basis of this theorem (proof is omitted), it can be seen that the 
characteristic function of a random variable is a positive definite func- 
tion. In fact, if F{y) is a CDF, then obviously /(.r) will be, by definition, 
the associated characteristic function. The converse is not true, 
since F(?/) needs to satisfy the additional conditions F( — «) = 0 and 
F(-|- oo) = 1 in order to be a permissible CDF. 

Khinchin's Theorem. The necessary and sufficient condition for a func- 
tion p(t) to be the autocorrelation function of a stationary stochastic 
process is that p(r) could be represented as 

p(r) = j ^ cos tx dF {x) — 00 < X < 00 (10-89) 

where F(x) is a CDF. 

An equivalent statement is the fact that the necessary and sufficient 

* Positive definite functions arc commonly used in physics and engineering. The 
material of this section may serve as a reminder of the existence of links between the 
theory of positive definite functions and probability. The interested reader is referred 
to S. Bochner, '‘Harmonic Analysis and the Theory of Probability,” University of 
California Press, Berkeley, Calif., 1955. The section may be omitted in a first reading. 

t Original definition introduced by M. Mathias Uber, Positive Fourier-Integral, 
Math. Z., Bd. 16, pp. 103-125, 1923. See also K. Fan, Les Fonctions definies-positives 
et les fonctions compld.tement monotones, M6m, sci. math.f fascicule 114, Paris, 1950. 



STOCHASTIC PROCESSES 


371 


condition for a continuous real function p(r) to be a permissible auto- 
correlation function for a stationary process is that p(t) should be positive 
definite and p(0) = 1. The proof of sufficiency will not be presented 
here. However, the proof of the necessity which is comparatively simple 
is given : 

The autocorrelation function p(r) must be continuous and satisfy the 
three conditions of Sec. 10-9. 

71 n n n 

y y pin — Tk)ahdk = 2 ^ ■ ahdkdF{x) 

= /-- ( 1 1 

/i ^ 1 I 

n 

= I y I y I' dF{.r) > 0 (10-90) 

/l = l 

Thus there is the significant result that, if p(t) is representable in the 
above form, p(r) is positive definite. 


PROBLEMS 


10-1. Consider the stocdiastie process 

1X(0 I = A cos (al H- B sill U 
where A = B = 0 


CA = (FB ~ 

AB =0 


(a) Is this process stationary? 

(b) Study the autocorrelation function of the process when A and B arc normally 
distributed. 

10-2. Prove that the following proc.ess is stationary if Ak and Bk are uncorrelated 
random variables with zero means and standard deviation tr. 


lA^WI = 


n 

^ {Ak sin ukt + Bk cos 


10-3. Consider a stochastic process consisting of rectangular pulses. The height of 
the pulse is a random variable varying between 0 and 1 volt with uniform probability 
distribution in that interval. The widths of the pulses are all equal, and the heights of 
successive pulses are independent. 

(a) Find the autocorrelation function of the process. 

{b) Is the process stationary ? 

(c) Find the power spectrum. 

10-4. Is the process described in Prob. 10-1 ergodic? 

10-6. Find the correlation function of a stationary random process whose spectral 
density is 


}l/„{(i)) rs k = const 



372 


SCHEMES WITH MEMORY 


10 - 6 . A stationary stochastic voltage process with a correlation function c is 
applied to an RLC scries network. Find the power spectrum of the current flowing 
in this network. 

10 - 7 . Study the behavior of a general lossless one-port under the effect of a sta- 
tionary-process driving force whose autocorrelation function is (Knijiloy 

Foster’s reactance theorem.) 

10 - 8 . Which one of the following functions is admissibh* as an autocovariaiice or 
auto(;orrelation function of a second-order stationary process? 




{}») (Irajih of Fig. FlO-8. 



10 - 9 . The correlation function of a stationary proc(‘ss is giviui by 


Pxx(t) = Ac“*bl 

where A and k are appropriate positive constants. Find the spectral density of the 
process. 

10 - 10 . A stochiistic process is described by 

I j*(<) I = A cos (/ -h a) 

where A and a are statistically independent random variables. A is normally dis- 
tributed with zero mean and standard deviation equal to 1. cy is distributed betweiui 
0 and 27r w ith a uniform dtuisity of 1 /'Itt. 

(а) Find the first- and the second-order ensemble average's. 

(б) Find the first- and the second-order time avi'iagi's. 

(c) Find the autocovariaiice function. 

(d) Find the autocorrelation function. 

(c) Is this process stationary ? 

(/) Is this process ergodic? 

10 - 11 . A noise process |A’(01 goes through a delay lU'twoik; that is, the output 
of the network at time t is 

\X{i-r)\ 

Consider a linear combination of the two processes: 

\Y{t) \ - K^{X{t)] +K,{X{i -t)1 

where Ki and K 2 are specified constants. Find the auloiAon elation function and the 
power spectral density of | >"(0 j in terms of those parameters of IX (0 ) . 



fl a ->|<- a c -4*- o ->| 


Fig. PlO-11 



STOCHASTIC PROCESSES 373 

10 - 12 . Find thr cross-correlation function between the two processes given below : 

I X (<) 1 = A sin (lit B cos tol 
1 1'(0 1 = —A cos (lit -h B sin oit 

10 - 13 . Study the band-limited proc^ess 


UV(0l 



sin Tr{2v)t — k) 
x(2w7 — k) 


where yljt’s are normally distributed independent riiiidom variables with zero means 
and equal variance. 

10 - 14 . Consider an AM signal carrier = « cos (a? J -f- 0), where a(0 and 

0(0 arc, respectively, the corresponding envelope and phase. Show that, if a{l) and 
0(0 are ergodic, the same is true for i/{l) (see Middleton, Sec. l.G). 

NoUr. For additional i)roblen\s (onsult Ihivenport and Root, Chaps. 6, 8, and 9; 
Middleton, Chaps. 1, 2, and 3, and Laniiig and Baitm. 



CHAPTER 11 


COMMUNICATION UNDER STOCHASTIC REGIMES 


11-1. Stochastic Nature of Communication. In the study of prob- 
ability theory it was pointed out that, when the number of random varia- 
bles is not finite, we deal with stochastic processes. It was also pointed 
out that dealing with stochastic processes and their related problems 
requires some special techniques. The preceding chapter was concerned 
with an introductory presentation of the theory of stochastic proce^es. 
This chapter provides a brief application of the subject to information 
theory. 

In dealing with physical sources of communications, such as teletype or 
radio communication, we are generally confronted with time series. For 
example, when a simple binary source transmits O’s and I’s, theor(!tically 
we have a doubly infinite sequence of O’s and 1 ’s which must be considered 
as a member of an ensemble of such doubly infinite seiiuenccs, 

. . . , 1 , 1 , 0 , 0 , 0 , 1 , 0 , 1 , 1 , 0 , 1 , 1 , . . . 

. . . , 1 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 0 , 1 , 0 , 0 , .. . 

If for convenience we assume that the letter Xk is transmitted at a specific 
instant tk, then Xt is a random variable assuming either one of the values 
0 or 1. A random message of this source is ivritten as 

. . . , X_,, Z^., Xo, Xi, x^, . . . 

Similarly for a teletype we have an analogous doubly infinite series where 
each Xk can assume any of a teletype’s characters. The characterization 
of such information sources directly follows from the characterization of 
their time series. For example, as discussed in the preceding chapter, 
the joint probability distribution of, say, X* andJf, should be obtained 
from either an analytical or experimental description of the process. 

In Chaps. 3, 4, 8, and 9, we confined ourselves to independent sources 
where successive symbols were selected independently from a finite 
alphabet. In the present chapter we relax this constraint and allow 
sources with some sort of interdependence among the transmission proba- 
bilities of the symbols. Therefore, it is natural to seek to establish an 
information theory dealing with such time series. More specifically, 

374 



COMMUNICATION U>rDER STOCHASTIC REGIMES 375 

it is our task to associate a communication entropy with a stochastic 
information source (with memory) and to study the transmission of 
information in channels as before. Furthermore, we wdsh to establish 
some theorems analogous to the fundamental theorems of transmission of 
information in discrete and continuous channels. 

N. Wiener was one of the first scientists who clearly described the 
stochastic nature of communication problems. Wiener put in focus the 
fact that the communication of information is primarily of a statistical 
nature. That is, at a given time a message is drawn from a universe of 
possible messages according to some probability law. At the next 
moment another message from this universe will be transmitted, and so 
on. The joint probability distribution of the transmitted and the 
received messages contains all the mathematical information necessary 
for the study of a communication channel. This joint probability dis- 
tribution is not generally known, although usually we know the input and 
the conditional probabilities (noise structures). The development of an 
information theory for stochastic sources and channels began with the 
classical work of Shannon, who (‘onsidered sources of the Markov type. 
Considerable mathematical (darifi cation was brought forth later in the 
work of B. McMillan, who further defined stationary sources, their 
entropy, and the fundamental theorems of information theory. Further 
elaborate proofs and treatment are due to A. Feinstein and A. 1. 
Khinchin. These contributions have considerably clarified the subtle 
concepts of stationary sources, entropy, and the fundamental theorems 
initiated in ShannoiFs work. 

A complete presentation of stochastic information theory is not 
advanced here. The reader with a professional interest will find the 
above references indispensable. The following is aimed at giving an 
introductory treatment of the subject at a level suitable for the present 
first course in the subject. For those who may not be able to afford the 
time to study this part of the theory, the following preview will be of 
interest. The main portion of this chapter deals with the justification by 
mathematical techniques of the validity of our intuitive concepts of 
information theory of discrete and continuous random variables when 
applied to sources and channels with memory. The contributors have 
also showm under what hypotheses such generalizations are valid. 

In the subseciuent discussion, we shall first define and briefly describe 
a simple class of stochastic processes of great interest in engineering 
problems, the so-called Markov chains. We give examples of station- 
ary and ergodic Markov chains and will point out that in many com- 
munication problems the sources are of these types. The natural next 
step is to investigate the information-theory aspect of simple Markov 
chains. Given a stationary or ergodic Markov source, its communica- 



376 


SCHEMES WITH MEMORY 


tion entropy will be defined. Finally we shall present a brief description 
of the performance of a communication channel driven by a discrete 
stationary source. 

11-2. Finite Markov Chains. Consider a Markovian stochastic 
process with a finite number of states: 

[S] = . . . ,Sn\ ( 11 - 1 ) 

The chain consists of a seciuence of states from [S] such as 


, S2y S2j Sij iSs, 


where the probability of moving from any state Si to the immediately 
succeeding state Sj is prespecified by as an element of the so-called 
transition probability matrix [P\ = [p^J. \ 


Pll Pl2 

[P] = 


Pin 

p2n 


LPnl Pn2 PnnJ 



Note that all elements of [P\ are nonn(‘gative and that th(i sum of the 
elements of each row is unity. If all the elements of [P] do not depend 
on time, the associated Markov chain is said to be stationary. We deal 
only with stationary chains. 

Let be the row probability matrix corresponding to the chain 

reaching different states from all possible given initial states in r steps; 
it was shown in Eq. (2-117) that 

[p(r)] = [PW][PY (11.3) 

where [P^^^] is the row probability matrix of the initial states. When 
[Py for some values of the positive integer r has only strictly positive 
elements (no zeros) the chain is referred to as a regular chain. Of course, 
if [Py has no zero entry for r = ro, then all powers of [P] with r > ro will 
not have any zero entry. Then clearly, starting with a nonzero initial 
probability, from each state Sk one can reach any state Sj with a nonzero 
probability. 

A state Sk is said to be an absorbing state if it would be impossible to 
leave that state: 

P{Sk\Sk} = 1 

A Markov chain containing at least one absorbing state is referred to 
as an absorbing Markov chain. Obviously, an absorbing Markov chain is 
not a regular chain. A Markov chain is said to be ergodic if, after a 
certain finite number of steps, it is possible to go from any state to any 
other state with a nonzero probability. Thus, a regular chain is ergodic 
but the converse is not necessarily true (see Prob. 11-4). 



COMMUNICATION UNDER STOCHASTIC HECIMES 


377 


Example 11-1. Dctenniiic if the chain illustrated in Fig. Ell-1 is crgodic. 


2 



1 1 3 


Fid. Ell-1 

Solution. The transition probability matrix is 

0 0 1 
'i 0 H 
0 1 0 

One has to determine if any power of this matrix has any zero entry. A simple way 
of checking this would be to replace all the nonzero entries by, say, the number 1 
and call the new matrix [X]. Next derive lA’P, [X]\ lA J**, (itc. These multiplications 
are simple, and all nonzero entries in [A'p* can also be replaced by I, as the actual 
values of the entries are not of concern. In the prestmt example it can be seen that 
[,Yp has a zero entry, but [Xl^, [X1^^ - - . , have no zero entries. Th(’refore the chain 
is a n'gular Markov chain and is thus also ergodic. 

Example 11-2. Same question as in Example 11-1 for the* chain illustrated in Fig. 
Ell-2. 



1 1/4 0 

Fig. Ell-2 


Solution, The chain is noncrgodic and also non regular as state 1 is an absorbing 
state. 

11-3. A Basic Theorem on Ergodic Markov Chains. In this section 
we wish to show that for ergodic chains, in the long run, the probability 
of reaching any state Sj is independent of the initial state and further- 
more that for large values of r the probability of reaching a state Sj 
ill r steps is independent of r. In other words, the statistical prop- 
erties of the chain are somewhat homogeneous. More specifically: 

Theorem. Let [P] be the transition matrix of an ergodic Markov 
chain; then 

r pll Pl2 ■ ■ * PinV Vti <2 ■ ■ ■ 

lim P 2 I P 22 ■ ■ ' P2n I Nl t2 tn _ _ yr (11-4) 


,Pnl Pn2 ’ ' Pwn, 


tl ^2 • • • iJ. 



378 


SCHEMES WITH MEMORY 


where [/t] is a permissible transition probability matrix. In order to 
prove this theorem, we first establish the following lemma: 

Lemma. Let PhesLun X n transition probability matrix with no zero 
entry and V a column matrix with n positive elements Vk. Denote the 
largest and the smallest element of V by vm and v^j respectively, and the 
smallest element of P by am) then the largest and the smallest elements of 
PV = Uf um and Umj respectively, satisfy the following inequality: 

V'M ^ (1 2ofm)(^M “ ^m) (11“5) 

Proof. The proof follows by first finding a lower bound for Um and an 

upper bound for wji/ by a direct inspection of the matrix product , 

I 


~Pn Pi2 ■ 
P21 P22 

Pin 

• P2n 


'vi 

V2 



“Wl 

U 2 

_Pnl Pn2 ■ 

Pnn_ 


Jn_ 


Jin_ 


It will be shown that* 

^ OLjnV M (1 

Vm ”1“ (I 

To show this, let Uk = Um be the smallest clement of the [/ matrix. 

Uk = Um = PklVl + Pfc2<’2 + • ■ + PkmVm 

+ ■ ■ ■ + PkM^M + ■ • • + Pkn^\ (11“7) 

On the right-hand side let us add and subtract a^vm + 

Um = “ OLm^m + Pk\V\ + ' ‘ + {pkm + ‘ 

+ {pkM — Olm)VM + ’ * ■ + PknVn ^ OLjnVM “ 

+ ( I ( 11 - 8 ) 
1^1 
n 

but ^ pti = 1. Thus 

i* 1 

Um ^ OLiriV M H" (1 

Similarly, if Uk = um is the largest element of the U matrix, one may write 

Um = PklVl H“ pk2V2 + ■ ■ ’ + PkmVm + ’ ’ ' + T>kMV M + ' ’ ‘ 

= OLmVm — C)LmVM + Pk\V\ + ' ’ ’ + (jpkm OLrr^Vm + ’ * ' 

+ {PkM + Ctm)VM + ■ ' ■ + Pkn^n ^ amVm “ M 
n 

+ ( ^ Pifcl) Vm = OLmVm + (1 ^ am)V M (11-9) 

t-1 

* The proof given here was set forth by Josd Perini during a course on information 
theory. (See also Kemeny and Snell.) 



COMMUNICATION UNDER STOCHASTIC REGIMES 


379 


From these inequalities one concludes that 

Um — Urn < (X — 2 ol„)(Vm “ O 

Having established this relation, it is now possible to show that as k 
is increased the difference between the largest and the smallest element in, 
say, the first column of [ U] will be reduced. In fact, for a positive int(‘ger 
k the matrix [7^]^' is a transition probability matrix. Therefore the 
product of [7^]^ by the first column of [7^] will obey the above lemma. 

< [I - - vj^^] (11-10) 

But note that and are, respectively, the largest and the smallest 
term in the first column of |7^|'‘+*. ]n other words, 

< [1 _ - vj^^] ( 11 - 11 ) 

The iteration of this method suggests tliat 

,,(y.2) _ ^,,(.+2) < ^ 

< [1 - ( 11 - 12 ) 

But being the smalh'si- number in a transition probability matrix, 
cannot be larger than 3^. Therefore, 

[1 - <[1 - 2a (11-13) 

It becomes evident that the difference between the largest and the small- 
est element of each column in [PY becjornes smaller as k is increased. This 
shows that there is a limiting column matrix with all elements eciual to ti 
such that for large k the product of [7^]^' and the first column of [/^J 
approaches the n element column matrix with elements ec|ual to ^i. 
By a similar reasoning, we find that thon^ exists an n X n matrix T as 
described by the abov(‘. theorem. With som(^ additional mathematical 
computation, it can be vshown that a regular Markov chain has a unique 
probability matrix T. In fac,t, let [t] be any row of T; then 

lim 17^1" = T (11-14) 

k —* « 

Note that lim • [P]l = [r][P] 

Hence [7’J = [T][P] (11-15) 

Now let us compute the probability row matrix for a regular chain 
reaching any state k after a large number of steps n, having started with 
an initial row probability matrix 

=-- [/’^“’Jf/"]" = [/"''’»|[7’] (11-16) 



t2 

1 

h 


■ in 

J'l 

t2 

■ ■ <»_ 


[Pi® PjO ■ • • P„»] 


= [ii u • • ■ Q (11-17) 



380 


SCHEMES WITH MEMORY 


Under such circumstances the probability of reaching any particular 
state after a large number of steps will be the same ; that is, it does not 
depend on initial probability. 

For a regular Markov chain, the average number of times of being in 
any state approaches the corresponding probability entry in the T matrix. 
That is, the probability of occurrence of a state approaches the value 
specified by [^J irrespective of the initial probability. This fact is a direct 
conclusion of the law of large numbers. Example 11-3 will exliibit the 
correctness of this statement. (For proof see Kemeny and Snell, “Finite 
Markov Chains,” p. 73.) 

In closing this section it is useful to point out that E(|. (1 1-15) suggests 
a simple method for obtaining the elements of the T matrix. 

+ * ■ ‘ + tnPni = \ 

tiPl2 + t-ip^^ + • • • + t„Pn2 — \ 

(U-IS) 

tlpu, + + ■ ■ * + ' 

U + /2 + ‘ ■ ' + =1 


The values of h, / 2 , . . . , tn can be readily computed from Fa{. (1 1-18). 

11-4. Entropy of a Simple Markov Chain. CousidcT a simple' station- 
ary Markov chain with a finite number of states 


|-dj,yl2, • . . 

and the transition probability matrix 



~P\l 

P \2 



P21 

Pri 

■ Pin 


J)nl 

Pn‘i • 

pnv__ 


(11-19) 


(11-20) 


If the system is initially in state /!,, the probabilities corresponding to a 
transition of one step to any other state form a set of complete and 
exhaustive probability schemes. (Thus, the sum of the; elements in eaedi 
row of the transition probability matrix is unity.) 

(A,\A.), . . , {An\Adr 

[//I.,/-.-, . - . (11-21) 

• The', symbols (^ 4 * 1 ^ 4 .) and rospnetively, will bo used in this section as a 

short-hand notation for the following e;hain of events: 

- A,) n (ATo = A,) (XoH. = .4,) n (.Vo = .4.) 

For the purpose of computing entropie;s [Eq. (lJ- 26 )], all evemts with specified ?, /r, 
and r must be conside;red distinct. For example, (Asl^j)^'^^ may consist of /I1A1A3. 
A1A2A3, AiAaAa, A1A4A3, etc. It is for this reason that the notation will])e 
used to denote the set of probabilities of every individual member of (AfclAi) '’'^ 

This notation seems to be helpful in the initial discussion, in order to avoid more 
complex formulation. However, since it is mathematically awkward, it will be 
dropped after it has served its purpose. 



COMMUNICATION UNDER STOCHASTIC REGIMES 


381 


With this scheme, we associate an entropy Ht indicating the average 
amount of uncertainty of the system for moving one step ahead when 
starting with state A^. 

n 

= - X P-J loK P'/ (11-22) 

If the probability of the system being initially in state A, is designated l)y 
p^J it is natural to calculate the average uncertainty of the chain for 
moving one step ahead from any initial state, when the initial states 
have specified probabilities, that is, 

il{X) = ITy> - 2 p.///” = - X X p^Po p.j (ii-2;{) 

1-1 1 — 

W(j may call H{X), or more specifically the entropy of the chain with 
specified initial probabilities for moving one step. 

More generally, the set of events of going from to any ofber state in 
r steps, 

[{A,(^^\A.), . . . , (11-24) 

constitutes a finite (complete probability scheme. 4'he entropy of this 
finite scheme is 

//,w ^ (11-25) 

y= j 

where stands for the probability of any one of a discrete chain 
moving from the zth to thejth state in r steps. 14ius, the entrop)y of the 
(diain for moving r steps ahead from the initial states wlnui the initial prob- 
abilities are specified is 

//W = /77^ = I = - i X logp,/^' (1 1-20) 

For example, in order to compute the entropy of the set of all meaning- 
ful three-letter English words (assuming simple Markovian structure) 
the following procedure would apply. Compute the entropy of all per- 
missible thr(Mvletter words out of the set aaa, aabj aac, . . . , aba, abb, 
abc, .... Do the same for three-letter English words beginning with 
b, c, etc. The application of Eep (1 1-20) will lead to the average entropy 
of three-letter English words. The entropy per letter is If very 

long messages are considered, the entropy per letter will provide an esti- 
mate of the entropy of the English language. 

In Sec. 11-5 it will be shown that lim exists for any ergodic 

T — > OO 

Markov source. The mathematical impact of this fact stems from the 
statistically homogeneous structure of the source output. (This is in 
a way analogous to saying that [P]” approaches a limit for w —> for 
regular Markovian sources.) 



382 


SCHEMES WITH MEMORY 


A simple theorem to be given here is as follows: 

Theorem. The different-order entropies of a regular Markov chain 
with initial probability [t] (as described above) arc additive, that is, 

— Ilia) _|_ jii&) ^ arbitrary positive integers (11-27) 

We present a simple proof suggested by Khinchin. 

Let us start from the state A^ and go to Ak in r + 1 steps. We do this 
by going first one step ahead and then r steps. The entropy will be 

k=l 

The associated average) entropy for the sysieun to move r + 1 steps is 

/^(r+l) ^ ^ (11-W) 

.^1 \ 

where pt is the initial probability of the chain starting with the ith state. 
Substituting Eq. (11-28) in E(j. (11-29) we find 

H^r+X) = I ^ p. ^ 

1=1 1=1 k -^1 

= //<» + 2 + • • ■ + (11-30) 

1 = 1 

Thus, due to Eq. (11-18) 

fjir+l) _ //(I) _|_ Hir) (11-31) 

As an immediate application of this theorem we have 

//(r) = //(I) ffir-i) = 2//<i> + 7/(^ 2) = = rII{X) 

The entropy of the Markov chain for moving r steps ahead is equal to 
r times the entropy of the chain for moving one step ahead. Since 
7f(i) = II (X) is the basic entropy associated with the Markov scheme, 
it is interesting to note that the entropy relation of such a chain for mov- 
ing r steps ahead is similar to the entropy of an extension of a discrete 
independent source without memory, as discussed in Chap. 4. Based on 
the foregoing, it is possible to define the communication entropy of a 
regular Markovian source starting with any arbitrary initial proba- 
bilities as 

H{X) = lim — (11-32) 

n — > eo 

The same defining equation applies when the source is stationary but not 



COMMUNICATION UNDER STOCHASTIC REGIMES 


383 


necessarily Markovian. It is of mathematical interest to establish the 
existence of this source entropy for stationary sources with a finite alpha- 
bet and a finite memory (finite intersymbol effect). The existence proof 
is given in Feinstein (I, p. 85). 

Example 11-3. Considor the Markov state diaRrain of Fig. El 1-3. If the initial 
probability matrix is = 1* 4 Jil, ^nd 

(a) the probability of reaching state A in one step. 

(b) the probability of reaching state B in one st(‘p. 

(c) the probability of reaching state A in two steps. 

(d) Pb^^\ the probability of reaching state B in two steps. 

(e) Pa^^\ the probability of reaching state A in thn^' stc'ps. 

(/) Pb^^\ the probability of reaching state B in three steps. 

(<;) The i matrix. 

(h) Compute the entropy 

{i) Compute the entropy 

{j) Compute the entropy IP^\ 

(k) Compare -h with IP^K 

(l) Same question as m pin t (/ ) l)ut with the initial proliability matiix 



( 6 ) 

Fig. Ell-3 


B 

A 

B 



384 


SCHEMES WITH MEMORY 


Solution. The solutions for parts (a) to (/) can he o])tained by matrix method or 
by drawing a tree and computing probability measures, as discussed in Chap. 2. 


(a) and (b) 

['4 



\i\ -i"» 

'Hi] 

(c) and (d) 

Ui 

H) 

[-'■3 

11 

«?i44] 

(c) and (/) 

m 

Vi] 

L'4 




}^ti + lit. = 

(ff) H^l + 

^1 + ^2 = 1 

4’he above equations yield 



Not(‘ that the answers to th(' previous ])arts approach i ratlu'r rapidly. 

(h) II = -H +log 3 

///,(!) = I 

= K 2 + log 3 

(0 log 9 + ^ log + M log 3 -b li log 3 = log 3 - H 

Ilgiv = I ^ log () + 1 3 log 3 + log 4 + M log 4 = li log 3 + J(^ 

= ^¥ 2 A log 3 +^^^44 
0) = M’«jlog3 + 3^7 

/yj,(3) = 1^2 log 3 + 

/y(3) = 10^44 log 3 +53^8 +i {08 

{k) //( 2 ) (liiTors very little from 

(/) The initial probability matrix is taken to be [1]; thus -)- ffVi) = //i3) holdg 
as an identity. 

11-6. Entropy of a Discrete Stationary Source. Shannon’s original 
work was confined to sources of the Markov type (Shannon |1], St'cs 2 
and 4). The concept of more general sources and their characterizal ion 
are due to McMillan. In this section we discuss M(;Millaii’s charactiu- 
ization of a discrete stationary source. 

Let A be a finite set of letters called the alphabet of the source, 

[A] = [ai,a2, . . . ,(In] 

Consider a source transmitting one letter from [A] at each instant tkj 
where tk is an clement of a doubly infinite time sequence t: 

[i] = [. . . . . .] 

A typical transmitted message is 

\X} = \. . . ,X^2yX-hX0yXlyX2y . • ■} 

where Xk is a random variable assuming any one of values G A for 
k G K. 

[K] = [. . . ,- 2 ,- 1 , 0 , 1 , 2 , . . .] 



COMMUNICATION UNDER STOCHASTIC REGIMES 385 

Our first task is to study the probabilistic nature of this source from its 
output. For this, we need to define what is referred to as a cylinder set 
of events. Amorifi; all members of the ensemble {X|, consider those 
sequences that have specified outputs at certain specified instants. More 
precisely, let, for example, 

= ai — a2 X 2 = (lb Xk ~ (In 

Then all the doubly infinite seciuences satisfying these specifications form 
a set K which is called a cylinder set. Each one of these sequences is an 
element of the cylinder. 

E [. . . , Ui, «2, *ri, Ub, . . . , Xk = On, . . .] (11-33) 

Now suppose that the output of the source is a stationary pro(a\ss; then 
it will be homogeneous in time, that is, any cylinder s(‘.t E will be carried 
onto a cylinder set with identical probabilities aft(T a shift of, say, oikj 
time unit. 

TE = [..., .r..2, -^’-1, «i, (I'li X 2 , gb, . . . , Xk^ Xk+i == dll) . • d 

In other words, messages of this type will most likely be among different 
output sequences of the source. The shift of time axis obviously may be 
in either time direction 

Furthermore, we assume that a probability measure is defincMl for the 
space of the message ensemble and that it is such that the probability 
measure of any cylinder S is ecjual to that of the shifted cylinder 7'M. 

E{TS\ = P{S\ 

P{TE\ = P[T-^E} = P\E} (11-34) 

The next step is to define the source entropy, that is, th(^ per-syinbol 
rate at which the source (units information. Consider all the secpHuices 
of the type 

E = [. . . ,:rfc,jfc+i,Xfr+2, . . . . . .1 (11-35) 

Now assign n specific letters from the alphabet [A] to the n positions 
Xky Xfc-fi, . . . , Xk+n-\, that is, the transmitter transmits one specific 
letter at a specified instant. Each of the distinct sequences of this 
type with a defined probability measure forms a cylinder S. Let ^{Sj 
be the probability measure of a particular cylinder S; then the com- 
munication entropy of the S(»t of A" possible ??-term sequences can be 
defined in the usual manner. 

//„ = - log//(.S’) (11-36) 

s 

The stationarity hypothesis asserts that Hn remains independent of the 
initial moment tk but of course is dependent on n. The crucial point in 



386 


SCHEMES WITH MEMORY 


McMillan\s definition of the entropy of a stationary source is the fact 
that the quantity 

H{X) = lim ^ (11-37) 

n— > oo ^ 

exists and most naturally represents the entropy of the source much in the 
same way that Shannon defines the entropy of an independent and a 
Markovian source. The following proof of the existence of the entropy 
for stationary sources is Kliinchin^s simplified version of McMillan’s more 
general results. 

Consider the following cylinder sets of messages: 

Am, the cylinder set with specified letters in m specified positions 

Any the cylinder set with specified letters in n specified positions 

Am-\-nj the cylind(‘r set with specified letters in the previously specil(ied 
positions 

Plvidcntly, the entropies of these cylinder-set families satisfy the re'^a- 
tions 

= n{A,.) + H{Am\An) 

n{Am\An) < II{Am) (11-38) 

Thus, using the notation of Fa\. (11-30) in the above, we find 

Hn < < Hm+ Iln (11-39) 

For the specfial cases of cylinders with m = 1 and m = n we have, 
respectively, 

Un < 

//2„ < 2Hn (11-40) 

The latter ineciuality is easily generalized to 

Hnk < nih (11-41) 

or, for /c = 1, //n ^ nlli, 

— < //i < + <» (11-42) 

n 

Equation (11-42) implies that an upper and a lower limit exist for //„/n 
by virtue of the Boltzano-Weirstrauss theorem, since Hn/n is bounded 
above and below. Let 


a = lim inf — - < + <» n--> (11-43) 

n 

Now to prove the convergence of the sequence Hn/n, it is sufficient 
to show that lim sup (Hn/n) = a. We may choose, for any e > 0, an 



COMMUNICATION UNDER STOCHASTIC REGIMES 


387 



(11-44) 


(11-45) 


(11-46) 


It follows that as n approaches infinity the entropy per symbol of the 
stationary source X tends to a or //(X), which is the entropy per lett(‘r 
at the source, f'urther restriction on the output of a stationary source is 
required in order to define an crgodic source, Ergodic sources are dis- 
cussed, for example, in Khinchin (pp. 49 -54). More general definitions of 
entropy for sources which are not necessarily stationary have appeared in 
the literature [Rozenblatt-Rot (I, II)]. 

Note on AEP. The foregoing extension of the theory was in context 
based on the so-called asymptotic equipartition property (AEP). A mathe- 
matical description of this important property, which was given by 
McMillan, may be found in Khinchin (Chap. 2) and Feinstcin (I, Chap. 
6). The following heuristic and brief description is included here. For a 
given finite source alphabet [A] of N symbols, consider the cylinder set C 
with n specified letters. Each sequence in C may be regarded as one of 
elementary events of a finite probability scheme. Every sequence of 
C is a cylinder of the infinite space A^ of the source with a definite prob- 
ability measure n{C). Of course, for stationary sources, this probability 
depends on n but not on the time. Next, in line with the material of 
Sec. 7-8, one may define a random variable 


Z„ = - i log fi{c) 
n 



388 SCHEMES WITH MEMORY 

For statiormry sources, the expected value of Zn is 


Z„ = “ “ ^ m(c) log ix(c) 

f 

The right-hand side of this equation can be identified witii ctic per-symbol 
entropy II„/n of the described n sequence. That is, 

Ztfi — ' 

n 

The AEP states that as n — > qo for stationary sources, Z„ approac)ies a 
definite limit (convergence in probability) called the source entropy. 
That is, for any arbitrarily small € > 0 and 5 > 0, we can find a suffi- 
ciently large n such that ^ 


p{\Zn - fl\> e\ <8 

Stronger statements are possilde in the case of ergodic sources. For a 
formal proof, see references cited before. 

11-6. Discrete Channels with Finite Memory. In the preceding sec- 
tion we discussed a source with a finite memory emitting discrete signals 
from a finite alphabet. For phyvsical consideration, one desires to feed 
the output of a source into a transmission medium which is called a 
channel. The first step in this direction is to define a channel and its 
behavior under a stationary regime. 

A channel is a two-port with an input and output. The input to the 

channel has an alphabet (a source 
alphabet if driven by the source) which 
we assume to be finite. Similarly, we 
assume for convenience that the out- 
put of the channel also uses sequences 
of letters drawn from a finite alphabet. 
Let the input and output alphabets be 
j A ) and j ) , respectively. In memoryless channels the noise structure 
is generally specified by a conditional probability matrix 

P{hj\ah} for all hj G 

ttfc G A 

When the channel has no memory, the noise probability matrix is inde- 
pendent of the life history of the channel. When the channel has a 
finite memory, then the noise probability depends on the life history of 
the transmitted sequences up to the finite memory time prior to the 
emission of the signal. For example, for a Markovian channel the noise 



Output 


Fic. 11-1. A two-port Jiiialog of a com- 
munication ssystem. 




COMMUNICATION UNDER STOCHASTIC REGIMES 


389 


matrix is of the form 

P[Yk = hl . . . , X_i, Xo, Xi, . . . , A\1 = P\Y, = b\X,\ (11-47) 

As our objective is to give a more general description of a channel, a 
more general method for describing the noise is essential. For this pur- 
pose, consider a member of an input ensemble x and its eornjsponding 
mate at the output y (that is, if x is transmitted, y is receiv(‘d). 

input output 

[A] '' \B] 

fX} = {. . . ,T_2,.T_i,.To,Ti, . . .} {7) = |. . . . . .1 

I.et X^ and Y^ be all possible source and received setiuences, respectively. 
In X^, let us focus attention on a cylinder ^ which has a specific letter, 
say U], at a specific position, say Xa. 

= . . . , :ro, Ti, /a, :r3, a,, .rr„ . . . 

Similarly, for a moment, concentrate on a particailar cylindei* at the out- 
put, say which has a specified letter 6-2 at the position //i. 

= ■ - . , y-h ?yo, h, .v., . . . 

To know the noise characteristic we must know the conditional proba- 
bility of cylinder being received when ‘ is transmitted: 

Pl/yi.V’M 

More specifically, for all possible cylinders Ba C i^t the input, w(*, must 
have the conditional probability corresponding to any possil)le cylinder 
for messages at the output Sh C Y^. To sum up, the following require- 
ments are necessary in order to specify a general channel. 

1. Input alphabet [A\ 

2. Output alphabet [B\ 

3. P{Sn\SA] = V. for all Sa G and Sn G Y^ (11-48) 
Thus a discrete channel is specified by the set of triple data 

[A,v.,B] (11-49) 

If a channel is such that its noise structure remains invariant with respect 
to a time shift, that is, 

vtATS) = vAS) (11-50) 

(T being the shift operator), then the channel is said to be stationary. 

11-7. Connection of the Source and the Discrete Channel with Mem- 
ory. In the terminology of information theory, a channel is driven by a 
source in much the same way as a passive electric circuit is driven by an 



390 


SCHEMES WITH MEMORY 


electric source. The information source and the channel must have a 
common alphabet in order to provide a meaningful coupling. When the 
source transmits a letter Xk ^ A, then at the output of the channel the 
letter will be received as a letter yk SB. If a sequence of letters 

. . . j X—\j Xo^ ^^7 • ■ • 

is transmitted, a sequence of letters 

■ ■ ■ > v—h 2/oj y^7 y^7 ■ ■ ■ 

will be received. The probability distribution of F/t in the latter seque^nce 
obviously depends on the statistical properties of the input sequcrice|, or 
what could be referred to as the probability measure v^. If the probalbil- 
ity distribution of F* depends only on the statistical properties of the 
sequence • . • , Xk, we say that the channel is without anticipation. 
This implies that the statistical information about the present state at 
the receiver is specified by the past and the present states at the input. 
If, furthermore, the distribution of Fa: depends only on Xk-m, . . . , Xjt, 
then we say that the channel has a finite memory of m units. 

The situation of connecting a source to a channel is quite similar to 
many familiar deterministic setups. For example, when a passive two- 
port network is driven by an ordinary electric source, one has the setup of 
Fig. 11-1. Here we have the following basic specifications: 

1. An ideal source, ideal in the sense that its characteristic does not 
depend on the network to which it is to be connected 

2. A passive network, that is, no output at 22 unless a source is con- 
nected to 11 

The performance of the source and channel is specified by determinis- 
tic laws. Here, we know how to describe the output in terms of the 
source and the network parameters. Our present problem presents the 
information-theory analog of the afore-mentioned situation. The source 
transmits at random messages x G The correspondence between 
the output and the input of the channel is a random one because of the 
effect of noise, but the statistical description of this random effect is 
governed by the distribution ?»,. As in the case of channels without 
memory, we consider the product space of the pair, input signal x and 
output signal y. 

xSA^ 

ySB^ (11-51) 

(x, 2 /) G 

The specification of a probability distribution on is, in fact, similar to 
specif 3 dng the joint probability matrix associated with a product space in 
the case of discrete random variables. 



COMMUNICATION UNDER STOCHASTIC REGIMES 


391 


In a similar fashion one can describe a basic cylinder E as the 
product of a cylinder i?i G and Ez G It can be shown that we 
can associate a probability measure with each such cylinder of A 
more general event is split up into basic cylinders or limits of them in order 
to find its probability. The mathematical development for defining the 
probability measure of the product space is not given here. But one 
may visualize that in essence the treatment is very similar to the discrete 
case where a joint (product) probability is obtained as the product of the 
marginal and the conditional probabilities. 

P\xr\y] ^P{x] -P{y\x\ 

To sum up, the source and the channel may he described as a new source 
[C,co], where C is the product of the two alphabets, A X Pj and o) the 
appropriate probability measure.* The product space C acts as a source 
in a product space similar to the space |X,y} , and its probability measure 
CO is analogous to the joint probability matrix defined in Chap. 2. 

11-8. Connection of a Stationary Source to a Stationary Channel. A 
most interesting result of the previous discussion is the study of the 
stationary regime of source and channel. The following presentation is 
due to Khinchin. 

1. First of all, it can be shown that, if [A^y] and [A,Pj,B] are stationary, 
the product source [C,w] is also stationary. [For proof, see Khinchin 

(pp. 80-82).] 

2. Each stationary source has an entropy; therefore 

[A, Ml, [BM. [C,a,] 

each have definite entropies. [B,rj] is the equivalent output source of 
the channel. 

3. Let these entropies first be defined for all n-term sequences Xo, 
Xi, . . . , Xn-i emitted by the source and transmitted through the chan- 
nel; more specifically, 

Hn{X) <— 

IJn{y) ^ !?/0,//l, . . . 

I]n{X,Y) ^ (xo,/yi), . . . , (11-52) 

HJX\Y) ^ KaIF), (jq|F), . . . , (:r„^,|F)| 

Hn{Y\X) ^ {{X\y,), (Z|^i), . . . , (X|/y._i)l 

Therefore 

HniXJ) = H^{X) + //n(F|X) 

HniXJ) = H„(Y) +Hn{X\Y) 

* The probability measure for the joint event a: G •Si, y G ^2 is defined aa 
w(jS) = ci)(jSi S2) = vx{S2) dfi(x) 


(11-53) 



394 


SCHEMES WITH MEMORY 


11 - 4 . Show that the chain illustrated in Fig. Pll-4 is ergodic but not regular. 



[? ;] 

(o) Is the chain regular? 

(6) Is the chain crgodic? 

(c) Find the T matrix. 

11 - 6 . Draw a Markov diagram for a game of t(*nuis. Assume that a player has a 
fixed probability P for winning each point. Determine: 

(fl) Probability of winning a game when P = 0.60. 

(b) Probability of winning a set when P = 0.60. 

(c) Parts (o) and (&) for P = 0.51. 

J. L. Snell, Finite Markov Chains and Their Applications, Am. Math. Monthly, 
vol. 66, no. 2, pp. 99-104, February, 1959. 

11 - 7 , (a) Find the T matrix for the probability transition matrix below. 

Va 

0 H }4 

h h 0 

{h) Assuming an initial probability matrix of compute the entropies 

and and compare with 
(c) Discuss and verify the situation when 

11 - 8 . For Prob. 11-1 determine the entropy for each regular Markov chain. 

11 - 9 . Show that, for a regular Markov chain, there is a limiting probability aj of 
being in the state Sj independent of the starting state; a, is also the fraction of times 
that the process can be expected in state Sj (see Kemeny and Snell, Chap, 4). 



PART 4 


SOME RECENT DEVELOPMENTS 


L’importanoc d’un fait so itK'suro done i\ son roiidoment, c’pst-fl-dirc, ii. la 
quantity do pens4e quVllo porinot d ’Economiser. 

Si un rEsiiltat nouveau a du prix, e’ost quand, on reliant des ElEments eon- 
nus depuis longtemps, mais ius(|UP-lil, Epars et paraissant Etrangers les tins 
aux autres, il introduit subiteinent I’ordre la ou rE^nait I’apparenee du 
dEsordre. II nous iiormet alors de voir d'un eoup d'a-il chaeun do ees ElE- 
ments et la place qu’il oecupe dans I'ensemble. Co fait nouveau non-seule- 
ment est prEeieux par lui-ineine, mais lui soul donm' leur valour i\ tons les 
faits anciens qu’il relie. 

ITenri PoinearE 
Atti IV rongr. inlvrn. Mat., vol. I, p. 109 




CHAPTER 12 


THE FUNDAMENTAL THEOREM OF 
INFORMATION THEORY 


Frequent reference has been made to Shannon’s significant statement 
that by proper encoding it is possible to send informal ion at a rate arbi- 
trarily close to C through t,he channel, with as small a probability of error 
or equivocation as desired. In the context of our work, w(' have put in 
focus the accurate meaning of this statement for discrete as well as con- 
tinuous channels. The major objc'ctive of this chapter is to give a 
detailed statement and proof of the fundamental theorem for discrete 
noisy memorylcss channels. Unless otherwise specified, in this chapter 
we are concerned only with discret e noisy memoryless channels. 

A glance at the above statement, reveals that, there ar(^ several points 
to be clarified before the full implication of the statement is realized. 
These points are as follows; 

1. A quantitative definition of the words “error” and “freciuency” or 
“probability of error” in a communication system 

2. The relation between error and the ecjuivocalion entropy of the 
channel 

The definition of error (1) requires the description of a detection or a 
decoding scheme. Once we have a method for decoding the received 
signals, then w(' are in a position to discuss t he probability of error associ- 
ated with that method. Thus, our imm('diatc plan for the first part 
of this chapter (consists of the following: 

1. The definition of a decision (or detection) .scheme (Sec. 12-1) 

2. The definition of the error associated with a detection scheme 
(Sec. 12-2) 

3. A discussion of the relation between error and equivocation (Sec. 
12-3) 

4. A study of the transmission of information in the extended channel 
(Sec. 12-4) 

Subsequent to the presentation of these preliminaries, we shall turn our 
attention to the fundamental theorem of discrete memoryless channels. 

From an organizational point of view, the chapter is divided in the 
following four parts; Preliminaries, Feinstein’s Proof, Shannon’s Proof, 
and Wolfowitz’s Proof. 


397 



398 


SOME RECENT DEVELOPMENTS 


Wolfowitz’s proof is the most recent of the three. F'eiiistein’s proof is 
the first complete proof and uses a certain ingenious but complex pro- 
cedure. Shannon\s proof presents a good deal of physical insight into 
the problem and contains new ideas which may lead to fresh fields. 

In the following presentation, we have made an effort to use the nota- 
tion of the original contributors as far as possible. This conformity of 
notation will be found convenient for those readers who wish to study the 
original articles. 


PRELIMINARIES 

12-1. A Decision Scheme. A decision scheme is a method fhat a 
receiver employs for determining, after a word has been received, I which 
particular message word was transmitted. Let us re\'iew briefly the 
mathematical model of our discrete memoryless communication swtem. 
Each transmitted and received word is an n-letter sc(iuence like 

Uh = Xfcl, Xfc2, . . . , Xkn 
Vk = 2/fcl, Z/fc2, . • • , Vkn 

where Xu and yk% arc letters selected from a finii.e alphabet. At the 
receiver, knowing the transmission alphabet and the channel matrix, we 
wish to devise a detection scheme. That is, when a word Vk is received 
we must associate (rightfully or otherwise) one of the transmitted words 
with Cfr. This assignment calls for a decision that may be substantiated 
by some statistical inference, although the possibility of a random 
decision is not excluded. At the transmitter, we have an alphabet of a 
letters at our disposal, out of which N words cacL composed of n letters 
are selected for transmission. The receiving space is partitioned into 
N disjoint sets [A 1 ,^ 2 , . . . ,Ajv] in a one-to-one correspondence with 
[771,162, . . . jUn]. Whenever a word Vn G is received, we decide 
that the word Uj was transmitted. This setup is referred to as a decision 
scheme^ although we have not stated the statistical criterion on which the 
partitioning has been based. A decision scheme generally requires (1) 
a given input probability distribution P{u} and (2) a criterion for par- 
titioning the receiving space. The maximum-likelihood principle used 
in Sec. 9-14 is one of the most common decision criteria; this will be 
described in the next section. — 

12-2. The Probability of Error in a Decision Scheme. In this section, 
we wish clearly to define what are the error and the error probability 
associated with a decision scheme. Error in detection occurs when a 
word is not received in Ak while Uk is transmitted. The probability of 
this error is 

Ck = P\u = Uk\v G Ak) = 1 — P\u = Uk\v G Afc) (12-1) 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 399 

The error E associated with a decision scheme can conveniently be con- 
sidered as a random variable assuming values as stated in Eq. (12-1), 
with a probability P{v E A k\. Thus the average value of the error is 

N 

^ ^ ~ e ^*1] (12-2) 

k = l 

The meaning of the above notation should bo made clear. We find the 
error for a possible Ak] then the index h is changed to cover all the detec- 
tion regions. It is instructive to note that the average error probability 
can be equivalently derived by considering an alternative method of 
bookkeeping. The error occurs when a word Hk is iransmitted but the 
corresponding received word is not in Ak. This error probability can 
also be considered as a value taken by a random variable E. 

P\v G A[\u = = 1 - P[v G A,,\u = /uj (12-3) 

The average value of E is 

N 

E = I P\t> = v,\[\ - P[v e A,\v = 7/,. 11 (12-4) 

k- 1 


The equivalence of the expressions in Eqs. (12-2) and (12-4) can be 
exhibited by appropriate manipulation of the lernis. Inclination (12-4) 
can be written as 



P\v G Ak n u =-- iH ! 
" P\u~=Uk\'~ ~ 


Ei A kr\ II = ///. 1 


fr-] 

N 

= 1 - ^ P\n = G A,\p\v E A, 


N 


- X ''I' 

k — 1 


Ak\[l - P\u = n,|cG ^1.11 


(12-5) 


To sum up, observe that we have arrived at a clear understanding of an 
average error probability for any decision scheme in general. This error 
probability, as anticipated earlier, is a function of the input probability 
distribution, the channel, and the decision criterion. In particular, the 
decision scheme which, upon reception of the symbol chooses that 
transmitted symbol Uj whose conditional probability P[uj\v^] is the 
greatest is generally referred to as an ideal observer. Note that this 



400 


SOME RECENT DEVELOPMENTS 


definition is contingent upon the knowledge of the input or output 
message probabilities. 

Ak = {v^:P{^H\v^] = max ) 

12-3. A Relation between Error Probability and Equivocation. Fein- 
stein has investigated the useful concept of uniform error bounds. A 
decision scheme is said to be uniform error bounding with i)ound X, if 
there exists a number 0 < X < 1 such that 

P\v G A,\u,\ > 1 - X A: = 1, 2, . . . , A (^2-6) 

Obviously, if a deniision scheme has a uniform error bound, then Eqs. 
(12-2) and (12-4) yield ^ 

E <X (1^-7) 

That is, the average detection error probability will not exceed the uni- 
form error bound of the channel, provided that the latter exists. For 
a discrete memoryless channel, the following theorem holds: 

Theorem. Lot E be the error probability of a decision scheme with N 
detection regions. Then 

H{U\V) < -E log E - {\ - E) log (1 - E) + E log (N - 1) (12-8) 

Proof. A proof of this theorem follows from the convexity property 
of the entropy function. [See Feinstein (I, pp. 85-8fi).l Thv, following 
proof based on the law of additivity [Eq. (3-34)] is also of interest. 
Consider a received word Vk and the entropy H{U\vk). The original word 
associated with Vk will be denoted by v,k^ Of course, several received 
words may correspond to the same original word u^. Using Eq. (3-34) 
and the convexity of x log x, 

H{U\Vk) = —P{u^k\Vk} log P{7 Aa-|Ca} 

- (1 - P{ihk\vk]) log (1 - P[u^k\vk]) 

+ (. - /■i.-.iM) (X H 

J = 1 
j 9^ik 

< — log /■’{w.tly*) — (1 — 

log (1 — + (1 — P[vtk\nk]) log {N — 1) (12-9) 

M 

However, U{U\V) = P{vk}H{U\vk) 

* = 1 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 401 


where M is the number of all possible words in the receiving space. Thus, 

M 

II{U\V) <- I P\v,]P{u^\v,] log7^{M.,Kl 

h — 1 
M 

- I - iM'ukDloRd - 7'|'i/u|rd) 

k=l 

M 

+ £ PMil - /d'<.A|(.’Al) lOR (iV - I) (I2-IO0) 

fc = 1 

The summation over M possible received signals can l)e done conveniently 
by first summing over a region A/, and then proceeding over all such 
regions. 

H{U\V) < - ^ P\ve A,\P{v,\v e ^a 1 log G A,\ 

k = l 

- 2 P{i>eA,\il 


logd - P{v,\»eA,\) + I P\reA,\ 

(1 - P\uA« e *1a|) log (N - 1) (12-lOb) 

Using the values of Ck and E as defined in ICqs. (12-1) and (12-2) and 
the convexity argumcmi, one finds 

n{u\y) < - 1 e ^^1(1 - ^’a) log (1 - r,) 

- X Ad" e AAi’k log a'a + K log {N - 1) (12-11) 

A: = 1_ 

or H{U\V) < - (1 - ^) log (1 - E) - E log E + E log (AT - 1) 

When the channel is not Io.s.sIcaSS, at lea.st for some input probability 
distribution, tlui equivocation will be positive. That is, for a uniform 
error bound scheme the equivocation entropy will be bounded by 

0 < H{U\V) < - X log \ - (1 - X) log (1 - X) 

+ Xlog (N - \) (12-12) 

0 < H(U\V) < 1 + Xlog {N - 1) 

To obtain a more relevant bound on equivocation, consider the case of 
the nth-order extension of a discrete memoryless channel with a capacity 
C. According to Example 3-8, for a uniform input distribution we have 

H{U\V) = nH{X\Y) < 1 + XnC (12-13) 

The equivocation per letter also remains bounded : 

H{X\Y) <t-l-XC 


(12-14) 



402 


SOME RECENT DEVELOPMENTS 


l^'urthermore, the equivocation per letter can be made arbitrarily small 
by choosing an appropriately small X and an adequately large n. 

12-4. The Extension of Discrete Memoryless Noisy Channels. A 
heuristic and a formal proof of the fundamental theorem for BSC were 
presented in Chap. 4 when our knowledge of probability theory was very 
rudimentary. From the material of Chaps. 5 to 7 we have acejuired a 
working knowledge of the principles of probability theory. In the light 
of this development, we wish to present a more general proof. Our 
task can be divided into the following steps : 

1. Define an encoding-decoding scheme and compute the error associ- 
ated with a code. 

I 

2. Reconsider the transmission of information over a noisy memdryless 

channel driven by a discrete mcmoryless source. Extend the resillts to 
the nth-order extension of the source and the channel. \ 

lb Using the laws of large numbers as described in Chap. 7, fipd a 
relation between the number of messages to he transmitted, the err^r of 
the detection scheme, and the rate of transmission. 

4. Interrelate the parts discussed in 1, 2, and 3 to derive the funda- 
mental theorem of discrete noisy memorylcss channels. 

(Consider a discrete independent source transmitting symbols from a 
finite list [a] = [1,2, . . . ,o] with specified probabilities. The output of 
this source is fed into a discrete noisy memoryless channel with specified 
noise characteristics. Traditionally, we assume the input and the output 
to the channel as consisting of random variables X and F, respectively. 
These random variables assume values from the list [a] with certain proba- 
bilities. The schemes associated with these random variables will be 
denoted by U and F, respectively. 

The difTerent probability functions of these schemes are 


P[X = xi = A(x) 

P{F = 7/lX = ui =/(//|.r) 

P[X = 0-, F = 2/1 = /U',//) 
P{Y = y] =h{y) 


(12-15) 


where x and y both assume values from the list [a]. 

The rate of transinformation per symbol for the channel is 

U V 

Next we wish to evaluate the transinformation rate for the nth-order 
extension of the above channel when the output of the source consists of 
n sequences, that is, sequences of length n with elements from [a]. Let 
X" and F“ stand for the input and the output, respectively, of the nth- 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 403 


order extension of the original channel, that is, 


X- = Xi, X2, . . . , X. 
Y- = Ft, r., . . . , Fn 


(12-17) 


To conform with our previous notation, we let f/” and be the proba- 
bility schemes associated with the random variables and F”, respec- 
tively. Thus each typical sequence u of the form xi, .r 2 , . . . , is an 
n sequence with all symbols taken from [a], that is, a value of Simi- 
larly, for the output sequences of the channel we use v as a specific value 
of the random variable F”. The probabilities and the information 
measures associated with the schemes F”, and are 


P\X^ = u\ = 'pi{u) 

P{Y- = v\X- = u] = piv\u) 
P[A^" = 7/|F” = uj = 'p{u\v) 
P{ F“ = Vj X^ = u\ = p{UyV) 
/MF" = el = P2{o) 


/(A-;F") 



p{u,v) log 


p\{v)p2{v) 


(12-18) 


FEINSTEIN’S PROOF 

12-6. On Certain Random Variables Associated with a Communica- 
tion System.* Now that the necessary preliminaries have been covered, 
we proceed with Feinstein by introducing certain random variables 
associated with the source and the channel. The desired results are some 
relevant inequalities relating the probability distributions of these varia- 
bles with different entropies. We assume: 

1. Successive symbols transmitted by the extended source are selected 
independently. 

2. The channel has no memory. 

That is, we have 

(I) P{X^ = u] = P\Xi = Xu X 2 = 2 : 2 , . . . , Xn = .x„l 

= /l(*^l)/l(^2) • • ■ fl{Xn) 

(II) P(F” = v\X- = u] = P\Yi = yu Y 2 = 7y2, . . . , F, = ijn (12-19) 

|A^l ~ ^1) X 2 ~ ^ 2 j • • • j Xn ~ Xn\ 

= f{yi\xi)f{y2\x2) ■ • ■ /(2/«U’„) 

* The author wishes to thank Dr. A. Feinstein for helpful comments on the material 
of this chapter. 



404 SOME RECENT DEVELOPMENTS 

As a consequence of I and II we have 

(III) = u\Y^ = 2 ;} = fix,\y{}f(x2\y2) • • • fixr^lyrd 

As before, let us denote P\X^ = u] by pi{u) and P{X^ = u\Y^ = v] 
by p(u\v) and let H(U) and H(U\V) refer to the original scheme.* 

Some important relationships can be established between the number 
n, the source entropy, the channel equivocation, and the rate of trans- 
information per symbol. 

Lemma 1. Let 1>€>0, l>6>0be two arbitrarily chosen num- 
bers. There exist two numbers ni(€,5) and n2(e,5) such that 


P{p,(X«) > <5 n > m { 

P{p(.Y«lF") < <8 n> 712 

Proof. Consider the random variables pi(A") and p(X"|F”). lAs a 
consequence of I and 111 we hav(‘ \ 


logpK.Y") = I log/,(X.) 

t = I 

iogp(X’‘|F") = I i«g/(x,|r.) 


( 12 - 21 ) 


Because of I and 11, each of these random variables consists of a sum of 
independent random variables with identical distributions; thus the weak 
law of large numbers applies. A(;cording to Sec. 7-8 for /i — > 00 the proba- 
bility of the mean of the random variables deviating from their common 
mean by, at most, € approaches unity. That is. 


P [-//([/) - * < ^ ^ log MX.) < -IliU) -t- *1 ^ 1 

and (12-22) 

r j-//(c/|y) - ‘ ^ ^ iog/(AMF.) < -H{u\v) -f- *} -» i 

i = l 

Evidently, by properly choosing n, the probability of the mean of the 
above random variables deviating by more than e from their common 
mean can be made as small as desired. More specifically, for any € > 0 
there are numbers ni and 712 such that 


P j H{U) -I- i log pi(X") > el < 6 forn > ni 

” ' (12-23) 

P H{U\V) -1- ilogp(X’*lF’‘) > e < S forn > nj 

* The reader should note that pi(u) and fi{xi) are generally different functions, as 
arc piM\v) and /(xilj/i), and v{v\u) and /(i/ilxi). 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 


405 


Therefore 

p l^log piC-Y") > -II{V) + e| < 5 forn > m (12-24) 

or P|pi(X") < I >1-5 forn > Ui 

P{p(X"|F") > 2 -"W'^i‘-)+.i 1 > 1-5 forn > nj 

[The reader should keep in mind that in the above inequalities pi(X") 
and p(X”lF”) are considered random variables.] 

12-6. Feinstein’s Lemma 

1. Let Z be a subset of the product space f/” 0 T” satisfying the 
inequality 

P|Zj > 1 — 5] 5 1 small positive number (12-20) 

2. Let f/o C such that 

P { Uii ] >1-6.. 62 > 0 

Then there exists a subset Ui of Vo such that for each Ut G Ui there is a 
set A^ol V such that 

P{Ai\ui] > 1 — a for each G Ui 

and P{(m|>1-«2-- (12-27) 

a 


Proof. We begin by n'calling that for any Ui 


A, = \v: {u„v) G 


Next we define Ui as the set of all those ih satisfying the following two 
properties : 


P\A\uA > 1 - a 


(12-28) 


So far we have defined IJ\ and the A^ as (h'scribed by the lemma. It 
remains to show that /^|f/i| > 1 — 62 — hi/ a. To this end, we may 
introduce an auxiliary set U 2 whose elements Ut satisfy 


Thus 


P[A^\u^] < 1 — a 


(12-29) 


P{V, e A'„ M,} = P{v^ e A’,\u,]P{u,] > aP{u,] 

for M* G U 2 (12-30) 


We find 


SP[^i;nn*l > aPlUi] 
Mfc G 


(12-31) 



406 


SOME RECENT DEVELOPMENTS 


Note that the pairs (u**,, v* G AJ.) are not in Z. That is, their total proba- 
bility is less than 6 i. Thus 

(12-32) 

It remains to see how U 2 , Ui, and Uo are interrelated. For this, we 
note that 

t/i = t/o - (l/oH U 2 ) (12-33) 

P{Uo\ = P{Ui} +P{Uor^U2] 

In addition, 

P{u,r\Ui\ < P{U2\ < - 

CL 

Therefore 

P{Ui] > 1 - 62-- 

a 

12-7. Completion of the Proof. Now we are in a position to apply 
Feinstein’s lemma to sets Z and Uo as described in lemma 1 . LW; Z 
be the set of all (u^v) satisfying the second inequality of Eq. ( 12 - 20 )\ we 
know from lemma 1 that, for any 5i > 0 for n sufficiently large, P\Z\ > 
1 — 61 . Furthermore, let Uo be the set of words u satisfying the first 
inequality of Eq. (12-20). In view of lemma 1 for any 62 > 0, we have 

P{ Uq] >1 — 62 for n sufficiently large 


Since Z and Uo satisfy the hypothesis of Feinstein^s basic lemma, we can 
assert the existence of a set (7i = {tti, . . . ^Un] and the existence of the 


corresponding Ai such that 

t/iC Uo (12-34a) 

= \{uk,v)'.v G C ^ G Ih (12-34f>) 

P{Au\uk] > 1 - a (12-34c) 

P{t7,l > 1 - ^ (12-34d) 


From (12-34a) and (12-346) we see that for all {u,v) G 

Pi(m) 


k = Ij 

(12-35) 


This implies that for any (UyV) G fc = 1 , 2, . . . , N, 

piu,v) 


and so for M = Uj, 


Pi(u) 




(12-36) 




v^Ak 


Pi(w) 




This of course implies 


P{At) < 


(12-37) 



THE FUNDAMENTAL THEOREM OP INFORMATION THEORY 407 


From these results and other remarks we get the following central 
theorem of information theory. 

Theorem. With the previous notation, for any 0 < // < C, e > 0, 

6 > 0, there exists an n > n(e,5) such that for the nth-order extension of 
the channel we can devise a detection scheme with bounded error; that 
is, we can select a set of N messages, say M[e\ = such 

that the receiving space could be partitioned into N disjoint sets 7?, with 
the following characteristics; 

1. To each Ut G M[e] = M there corresponds a Bi. 

2. B,Bk = 0 i 9^ k t, /c = 1, 2, . . . , iV. 

3. P\B:\n^ > 1 - c i = 1, 2, . . . , iV 0 < c < 

4. < L = 

5. There does not exist a set of AT + 1 messages satisfying conditions 
1, 2, and 3. 

0. N > 2"". 

Proof. In several instances (see, for example, Secs. 4-16, 4-11, and 
9-15) when using the concept of minimum-distance cieteclion, we have 
demonstrated the possibility of devising detection schemes with disjoint 
BtS and bounded error. I’rom the foregoing material we already know 
that a maximal* detection scheme satisfying 1, 2, and 3 can be devised. 
However, no indication of the size oi N has yet been given. We have no 
idea how many n-seciuence words could be transmitted without violating 
the constraints 4, 3, and 6. To this end, consider the set Ui of input 
words 7//i, <4, • • • ,i/A^'disjoiiitfromilf[c]asanenlargiiigset Ai[a,/c] with 
a < c, J^{Ak\ and not necessarily disjoint At’s. We 

exploit the fact that an element of M\ajc] cannot be used for increasing 
the number of words N in M[c|. Of course, must intersect one or 
more of /^^’s; otherwise it could have been used for enlarging M— M is 
used as abbreviation for M[e]. The following information is available 
concerning the set Ck of elements of Ak not in i^,’s. 

Ck = Ak - AkC^i^B,) ( 12 - 38 ) 

i“= 1 


* By a maximal set is meant the realization of an N satisfying 1, 2, li, and 6, A 
maximal set which in adtlition satisfies P\Bt] <. L,i = I, 2, . . • , A, is refund to as 
a maximal L-boundod set with respect to a given input probability distribution. A 
set satisfying 1, 3, and P l/?d < L, i = 1, 2, . . . , A, is referred to as an enlarging 
set with respect to a given input distribution. The concepts ma£%maL sets, houndeM 
sets, and enlarging sets were introduced by Feinstcin. Our original plan was to apply 
Feinstein’s lemma directly to BSC and to derive the fundamental theorem. Instead 
we have given here our version of Feinstein's proof, deviating very little from his nota- 
tion but forgoing mathematical refinements (such as the existence and number of 
maximal sets, etc.). These finer points are generally evident from a physic^ point of 
view. For a more comprehensive coverage see Feinstein (I), Chap. 4, and Khinchin, 
Chaps. 1 and 4. 



SOME RECENT DEVELOPMENTS 

P\C„] <P\At] < 

P{C,\u',} < I - e uleU, 


(12-39) 

(12-40) 


The following inequality will provide the final step recjuired for the proof 
of the theorem: 


P{U 5.} = ypi(M)p{U B,\u] 

i-=l ^ t=l 

X + X 

u^~Mr\U^ Mr\ih » = ^ 

> (e - a){P[Uj] - P\Mr\ I/il) + (1 - e)P{Mr\ u[} 

> (e — a) ^1 — 52 — since > c > a (1^-41) 

\ 

Thus from condition 4 and from the fact that for the elements of the^irst 
set 

P\ VJ II\u}P{Ak r\ U B,\u} > c - a 
1=1 1=1 

we see that 

> (^ _ a) _ 5 , _ -t- (1 - e)P{M H Ui\ (12-42) 

Given e and H < C, we now consider the input probability distribution 
Pi(w) which maximizes R. Furthermore, let 


C = max of R 

ti + i2~ •— 2- - 


(12-43) 


1 - S,-^> 1.^ 

n sufficiently large 


N > min (e — a, 1 — c) — 62 — 2 "^ 

N > min ’ 2 '‘"+"(‘'-">'* 


(12-44) 


For sufficiently large n we have 


N > 2 "^ 


(12-45) 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 409 


SHANNON’S PROOF 

12-8. Ensemble Codes. Subsoiiuoiit to Fcinsiein’s formal proof of 
the fundamental theorem of encoding for discrete channels in the pres- 
ence of noise, Shannon provided an alternative proof which does not 
employ Feinstein's ineciualities. Shannon’s proof is V)ased on the con- 
cept of ensemble codes. lie considers a class of distinct codes selected at 
random and evaluates the iiverage error probability not for any specific 
member of this ensemble but for their totality. Tliis average proba- 
bility of error may be boundc'd betweeji some limiting values which are 
not too difficult to derive. Thus, since there is at least one member of 
the ensemble of codes which has a probability of error less than the aver- 
age error for the ensemble, there is a code whose error probability is less 
than the upper bound. In other words, if we were given an encoding 
raachiiK', which s(de(;ts any one of, say, k specific encoding procedures, 
then, on an average, the machine cannot do better or worse than certain 
limiting behaviors. The limiting average error values for the code 
ensemble, of course, depends on the over-all nature of th(' coding-decod- 
ing scheme. Shannon’s method has the merit of (juantitatively exhibit- 
ing the behavior of the probability of error as a function of word length, 
as will be discussed later. 

In order to appreciate fully Shannon’s technique, we shall devote this 
section to the formulation of the problem. Consider a set of M messages 
which are to be encoded and transmitted via a discrete memoryless 
channel with a finite alphabet in the presence of noise. The messages 
and the encoding alphabet are 

[m] = [mi, mo, . . . ,m,,;| 
fal = [Grj,a2, . . . ydo] 

The channel is defined as usual by its stochastic matrix: 

[7^1 W|] k,j = 1 , 2 , D (12-46) 

As before, in order to combat noise, we use code words n symbols long, 
selected from alphabet [a]. We know that the per-symbol rate It of the 
transmission of information is closely related to M , the number of mes- 
sages, and n, the word length. I'or the moment, assume that wo have a 
transinformation rate in mind that is not beyond the realm of possibility. 
Then the word lengths are selected according to the equation 

R = - In M 
n 

M = 


(12-47) 



410 


SOME KECENT DEVELOPMENTS 


Assume that the transmitter’s vocabulary has B distinct words. Our 
first task is to suggest a rule of correspondence between the messages 
and a subset of the code words. For this, Shannon chooses the following 
independent random encoding scheme. 

Consider a wheel, the circumference of which has been partitioned into 
arcs numbered Wi, W 2 , ■ • . , Wb. The lengths of the arcs are selected 
proportional to elements of an arbitrarily specified probability row 
matrix : 

[P(mi},P(w,), . . . ,P\u„}] 

The code vocabularies are composed in the following manner: We select, 
at random, a word Uk to correspond to the message mi. (Messages are 
assumed to have equal probabilities of l/M.) Next, we select a word, 
still at random and independently of prior selections, to correspond to 
the message mg, and so on. This code will be referred to as cotlcl A'l. 
Of course, if we do this experiment all over again it is likely to produce a 
code book that is not identical with Ki. Indeed there are siich 
distinct codes, some of which may be satisfactory and others rather poor. 
For instance, a code book in which all messages mi, rn-z, . . . , have 
been assigned the same word w, is a highly degenerate and useless code. 
Our next task is to devise a decoding scheme for every one of these 
distinct codes and to make an estimate of the average probaV)ility of 
error in decoding. This is not for any particular code Kx hut over the 
ensemble of such codes. 

Consider a composite communication system having code words 
Uiy W 2 , . . . , with respective probabilities [P \ui] ^ P [uz] ^ . . . , 
P\ub\], fed into the channel as described by Eq. (12-46). Knowing 
the noise characteristics of the channel, we can readily compute the 
joint probability matrix of the input and output word pairs P { w,?; [ . For 
the detection of received messages, in this composite scheme, we assume 
the maximum-likelihood detection criterion as before. That is, the 
receiving space will be partitioned into a number of regions. The 
regions will be assigned to the transmitted words in a one-to-one manner, 
based on the greatest conditional probability. A region Ax will cor- 
respond to a transmitted word Ut if, and only if, for any received word 
Vk G Ax we have _ 

P{Ux\vk] > P{Uj\vk\ i = 1, 2, . . . , A/ j 9 ^ i (12-48) 

This decoding procedure will be modified to fit each of the B^ codes 
(Kif /tL 2 , . . . , etc.). Suppose that a word vo has been received. The 
receiver refers to the conditional probabilities computed in the com- 
posite code 

P{Ua\vo] > P\ufi\vo] > ■ ■ ■ > P{us\vo\ > P\u{)\vo\ > ■ • ' (12-49) 



THK FUNDAMENTAL THEOREM OF INFORMATION TUEOllY 111 

written in order of magnitude, and to the particular code K that was 
selected at random. He then chooses the message m associated with a 
u in K having the highest probability in the above sequence. Now we 
shall consider the probability of making an error in decoding in the case 
where Wo is sent and received. Suppose that a word corresponding to 
a message mi was transmitted. The receiver will make an error only if 
one of the u to the left of uq in the sequerKiO (by order of conditional 
probabilities) has been assigned to a message other than irii, or if more 
than one message has been assigned to uo. The probability that a 
particular mcvssage, say m-i, other than mi w^as assigned a u to the left of 
Uo or to uo in the sequence is 

P{y\ for all G (12-50) 

where *Sr„(t«o) is the set of y such that > J*\iU)\vu\. 

The total probability of this in the ensemble code Ni,„(/o») is the proba- 
bility of error: 

= 2 l^\y] (12-51) 

M G 5»„(mo) 

The probability of having no ambiguous situations, that is, no other 
particular message m 2 being asvsigned to any words y G is 

1 QwoC'^^o) 

Owing to the argument of independence of successive message encoding, 
the probai)ility that no other one of the M — 1 ni(\ssages will correspond 
to words in Sv^iuo) is 

[I - Qr.{y.}V^ ^ (11^-52) 

Conversely, the probability of having an ambiguous or erroneous situa- 
tion is 

1 - (1 - Q.Au\)^-^ 

This is, in a way, a measure of probability of error for any particular 
pair of words (wn,en) in the code ensemble. The next step is to compute 
an average probability of error Pa for the code ensemble. Let P[UjV] be 
the probability of using a particular pair of code words (UjV) in the code 
ensemble. The average error for the code ensemble is 

Pa = 1 - [1 - = llPM[l - (1 - QAu})^-^] ( 12 - 53 ) 

U V 

Note that the probability of error P. of the best codes of the ensemble 
cannot be more than Pa- Thus Shannon obtains a measure of the average 



412 


SOME RECENT DEVELOPMENTS 


error probability of the code ensemble in terms of In the follow- 

ing section we show that Pa and Qv{u] are closely related to the word 
length n and the rate of transmission of information in the channel. A 
link between Qv[u} and the channel parameters is provided by a theorem 
due to Shannon. 

12-9. A Relation between Transinformation and Error Probability. 

With the notation of the previous section, we consider the transmission 
of M cquiprobable code words. As before, let P[u} and P{UjV} stand 
for the input and the joint probabilities of word u and pair of words 
(u,?;), respectively. The mutual information per symbol between the 
words (u,v) is / 


It is convenient to think of this quantity as a random variable Z. tkieh 
possible value of the pair {uk,Vj) is a permissible value of the variable. 
Furthermore, each such value has a probability associated with it. 'l"he 
CDF of this variable will be designated by F{z). 


F{z) = P 


I{u,v) < z 


(12-55) 


The following theorem interrelates F{z) with the error probability P^ of 
our encoding scheme. 

Theorem. For a given P[u\ on a discretes memoryless channel [hence 
specified F{z)] and arbitrary 0 > 0, there exists a block code with pre- 
specified M equiprobable code words such that the error probability Pe 
is bounded by 

Pe < F{R + d) + ( 12 - 56 ) 

Proof. We divide up the (u^v) pairs in two complementary sets T 
and T\ If for any chosen word pair the mutual information per letter 
is larger than the chosen reference value oi R + 6j we include that pair in 
the set T, otherwise in T . Analytically, 


(11,^0 G T ^ I{u\v) > R + d 

{u,v) e r <^^i{u;v) < R + e 


(12-57) 


The total probability of the random variable Z associated with the code- 
point pairs in T', by definition of F{z)y is thus equal to F{R + 6). 

Now we employ the result of the previous section. That is, we con- 
sider the ensemble code and its average error probability. The average 
error for the code ensemble can be rewritten by dividing the summation 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 413 

operation described in Eq. (12-53) into two parts, first over all the ele- 
ments in T and then in 

P. = - (1 - 

T> 


+ - (1 - 

T 

But Qd{u 1 < 1; therefore 

Pa < 1<\R + 0) + X - (i - 0.(W|)"-'I 

T 

Pa < F{R -1- B) + M^P\u,v\QJ,u\ 


(12-58) 


(12-59) 


[See the analogous derivation of Ecj. (0-75).] 

The final ste}) required is an estimation of Qv\i( \ for {'u,v) G T. We 
note that 

p^i\R\ir\ ^ ^ 

or { D ) (12-60) 

In order to relate this inecjuality to observe that 

for It' e ^>„(ii) (12-61) 

P{u'\v\ > (12-62) 

1 > X 

Penally we find 

jit 1 (12-63) 

Pa < F{R + 0) + m2 /' I 't.ejr 
Pa < R{R -f 0) -1- (12-64) 

Pa < F{R + d) + 

Since the average error for the ensemble code is smaller than the quantity 

F{R + 6) + c""®, at least there must exist a code with an error proba- 
bility less than or equal to this same quantity. Thus, the theorem is 
proved. 

Pe < F{R + 0) + (12-65) 

As n is increased, for a fixed d {d can b(* selected arbitrarily small), the 
term and also R become smaller and smaller. Similarly, for a fixed 
M, one may select n large enough to keep R and F{R + 0) arbitrarily 
small. Note that the probability of error decreases approximately 
exponentially with the increase of word length. The exponential feature 
of this interrelationship is an interesting result which will be further 
elaborated upon in the next section (see also theorems 2 and 3 of Shannon 
[4]). 



414 


60ME REGENT DEVELOPMENTS 


12-10. An Exponential Bound for Error Probability. The immediate 
objective of this section is to integrate the derived bound on error proba- 
bility in a suitable single exponential term : 


Pe < (12-66) 

where and are positive constants. In this form, one can directly 
examine the exponential manner of achieving any desired lower error 
probability with increasing n. This may be done by applying an inter- 
esting inequality due to H. Chernov.* This inequality gives a bound 
for the probability distribution of the sum of a finite number n of iden- 
tically distributed random variables in terms of n and their common 
moment-generating function. The inequality was first appli^ by 
Shannon to give a relation between error probability and the clmrac- 
teristic function of transinformation per symbol. He has also 'suc- 
cessfully applied this inequality in other instances in information- 
theory problems. It is hoped that the insertion of this section may prove 
useful for a more extensive application of Chernovas inequality in allied 
problems. 

Simplified Chernovas Inequality, hvt Sn be the sum of a finite number 
of indepcTident random variables with identical moment-generating 
functions. 

s = - = y X* 

n Z-/ 

1 

= 1, 2, . . . , n 

Then p(s), the CDF of will satisfy the inequality 
p(m'( 0) < ^or t < 0 

Proof. Chernov's proof is somewhat complicated, as he actually 
proves stronger results than the one given here. The above simplified 
version of Chernov's inequality can be proved in the following simple 
steps. 

1. Let X be a random variable with a probability density distribution 
/(x), and F{X) a real-valued function associated, .with X such that 

F{x) >0 for— ao<x<oo 

F{x) > 4 > 0 for .r < a 

Then P{X < a) < ^ E[F{X)] (12-69) 

• H. Chernov, A Measure of Asymptotic Efficiency for Tests of a Hypothesis 
Based on the Sum of Observations, Ann. Math. StatUiics, vol. 23, pp. 493-507, 1952. 


(12-67) 


( 12 - 68 ) 



415 


THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 
The proof follows immediately from the inequalities 

E[F{X)] = F(x)/(x) dx > F(x)/(x) dx > A dx 

(12-70) 

2. As an application of the above iiie(|uality, consider the case 

F{X) = 

> 0 for < a if / < 0 
Therefore < aj < t < () 

or F{X < a\ < e-'“4>x(0 t < 0 (12-71) 

3. Let p(s) be the CDF of S; then the application of the above formula 
to X = Sn yields 

p(a) = P{Sn < an} < t < 0 

p(a) < t < 0 (12-72) 

The parameter a may be conveniently chosen : 

p'(«) - a = 0 

p(m'(0) < t < 0 (12-73) 

As an immediate application of Chernov's ineciuality to the mutual 
information distribution, we let 


and find 


— H 6 


1 \ < _|_ (^n9 I < Q 


(12-74) 


Furthermore, Shannon points out that the parameter d can be suitably 
chosen to reduce tlu^ rijj;ht-hand side of the inequality to a single 
exponential term : 

d = tfi'it) - n{t) 

]\ < i < 0 (12-75) 


This inequality relates the bound for the probability of error with the 
function p(^), the logarithm of the moment-generating function of trans- 
information per symbol. For specified 0, Pc — ► 0 as n —> oo . 

Shannon then concludes with a number of geometric results on the 
limiting behavior of the probability of error as n increases. This new 
approach seems to indicate the possibility of special applications in 
several different directions. These directions undoubtedly will be 
investigated in time. At present, Shannon's work is the principal source 
of reference material on this problem, although a thorough understanding 
of his results requires extensive preparation. 



416 


SOME RECENT DEVELOPMENTS 


WOLFOWITZ'S PROOF 

The material of the following four sections is based entirely on the 
contribution of J. Wolfowitz. His principal work on the subject is 
contained in Wolfowitz (I, II, III, IV). 

12-11* The Code Book. For a given word length and error bound, 
Wolfowitz derives an upper bound of exponential form for the error 
probability. The alphabet contains a letters which may be conveniently 
taken as integers [a] = [1,2,3, . . . ,a]. All the N encoded words 
iui,U 2 , . . . ,un) are assumed to have equal length n and are referred to 
as n sequences. The received words are also n sequences. The detection 
scheme is based on distance criterion given in Eq. (12-80). We assume 
that a uniform-error-bound detection scheme exists, that is, 

P[v G A[\uk\ < X 

fc = 1,2, . . . , V 0^-76) 
P{v e A)c\uk\ > 1 — X \ 

Our problem is to investigate how small X can be for a given n and N 
or how small n can be made for a prescribed N and X. Consider all 
possible n-scquence words u to be transmitted; let N{i\v) be the number 
of times that letter i of the alphabet occurs in a particular word u. 
Let TT = [7ri,7r2, . . . ,^a] be a probability distribution for the a letters of 
the vocabulary. Obviously, we expect on an average a number of 
letters i to appear in an arbitrarily selected n seciuence. 

Since our final aim is to combat the effect of noise, we shall be rather 
selective in choosing a number of words out of the 2'^ possible words. 
We divide these words into two complementary categories, those in 
which the number of occurrences of any letter remains within a desired 
threshold value and the remainder. A good threshold level of compari- 
son is provided by a suitable multiple of the standard deviation of the 
binomial distribution about mri (see Sec. 7-6), that is, 

UTi + \/ amri{\ — 7r,j z = 1, 2, . . . , a (12-77) 

On the basis of this very logical approach, Wolfowitz defines a Trn sequence 
as a word u satisfying the inequalities 

|V(z|?z) “ niri\ < 2 \/ a'WK^{\ — tt*) z ^^1, 2, . . . , a (12-78) 

This is a rather clever choice of the u words for transmission, as will be 
shown shortly. 

Perhaps it is in order to pause a moment and ask ourselves whether it is 
possible to choose irn sequences, and if so, what is the total probability 
of the set of such words. This question is neatly answered by the follow- 
ing lemma : 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 417 

Lemma 

P 1 given n sequence is irn sequence ) > 

Proof. The proof follows directly from Chebyshev’s inequality (see 
Sec. 6-4 and Chap. 7). 

P{X is irn sequence) = P[ A [|.V(i|u) - WTr.l < 2 Vaw.^^7)]} 

1=1 

= 1 — P{ U [|jV('t|u) — nir,| > 2 \/anir,(l — ir,)]j 


> 1 - 


a 

I 


n7rj(l — TT,) 
4ttri7r/ 1 — ^ 


= % 


(12-79) 


Thus we have established that the supply of desired words for trans- 
mission is rather ample. Next, we need lo devise a detf^d-ion criterion 
for the transmission of these words. Let stand for the number 

of times that in the A;th letters of two given words u and v the letters i 
and j appear, respectively (/c = 1, 2, . , . , r/)* Now if u is the trans- 
mitted and V the received word, the expected number of N{i,j\Uyv) should 
be 

N{i\u)P\j\i\ 

where r{j\i] is the corresponding element of the channel probability 
matrix. When we look over a word pair to decide whetli(‘r they are 
likely to correspond to each other, we again may use a threshold of com- 
parison. Wolfowitz selects those pairs that satisfy the inecpialities 

\N{i,j\u,v) - N{i\u)P{j\i\ \ < 5[N{i\u)P{j\i}{L - P\j\i\)\''‘ 

i,j = 1,2, ... ,a (12-80) 

5 is a number larger than 2a. The reason for this selection will become 
more apparent later.* If a pair of words (u^v) satisfies the above require- 
ments, then we say that v is generated (caused) by u. Of course, in the 
real world, v may not be caused by n, Init Wolfowitz shows that this 
decision criterion is indeed a very intelligent one. The following lemma 
gives a preliminary estimate of the probability of error in our decision 
scheme. 

12-12. A Lemma and Its Application 

Lemma. Let u be any n sequence. Then 

jP{ received n sequence is generated by u\u is transmitted) > 1 — c' 


where e' = a^/8^. 

* Using the familiar distance argument of Chap. 4, it can be seen that 5 is a factor 
for selecting those received messages falling within a suitable radius of the transmitted 
message. 



418 


SOME RECENT DEVELOPMENTS 


Proof. The desired probability is 


6, = P(A {\N(i,j\u,v) - JV(i|M)P{j|tl| 

< 5 [Ar(r»p{iK}(i-p{y|i!)]Hn (12-8I) 

- ^ ~ X - N(i\u)PU\i}\ 

t 3 


> 8 VN{i\u)P{j\i}il - PUliDll (12-82) 


An appropriate detection scheme can be suggested such that the ptoba- 
bility of a correcit detection is not less than can lie made higher 

by iiKToasing 5. \ 

The next natural step is to estimate the number of letters i in a word 
V generated by a tt seciuern^e. For this, we Jioie that 


N{i\v) = X i = 1, 2, . . . , a (12-83) 

An upper bound for A(7|?0 be obtained liy recalling the upper 
bounds of A(j|7^) and N{j^i\x(,^v) as described in E(is. (12-78) and (12-80). 

^ [nw, + 2\/anwj{\ - Ti)]P\i\j] 

+ S{[mrj + 2-\/amr,{\ — rrj)]P\i\j}{l — P\i\j\)\'- 

i,j=l,2,...,a (12-84) 

On summing and replacing 1 — and 1 — P\i\j] by 1, one finds 
N{i\v) < ri'^Tr,P\i\j] + X 2 y/anT,P\i\j] 

+ 5 X + 2 y/'^,)P\i\j\V‘ (12-85) 

3 

Let 

^ TrjP[i\j} = probability that element i is in the-fcth place q{ v = ir[ 

1 = 1, . . . , a 

Hence, after some algebraic manipulation, 

N{i\v) <nw[ + 2 y/mi X -|- 2a5 X (12-86) 

3 3 

(Note that for any number 0 < x < 1, -v^x > ^x.) 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 419 


This upper bound of .V(?|?;) can he simplified in several ways. For 
instance, one may wish to apply IIolder\s inequality* to obtain 

N(i\r) < nir' + ‘,]a- n + 5) \/7r' (12-87) 

The corresponding lower bound can he derived in an analogous fashion: 

riTr' — 3a2 y/ n (1 + 5) -^Tr^ < N{i\v) < 

“1“ 3ci“ \/ ti (1 -f- 5) \/7r[ (12-88) 

Thus, Wolfowitz succeeds in obtaining relatively simple and informative 
upper and lower bounds for N{i\i}). For brevity, we refer to these bounds 
as v,u and Translation of this result in terms of probability is that- th(' 
prol)ability of receiving a word i», owing to the independence of its com- 
posing letters, will be bounded by the ine(|ualities 

n(^, <11 (,;"") (12-89) 

or exp '^{v,u log tt',) < y-’l-’l < exp log ttJ) 

Finiilly, l)ecause of Eq. (12-88), we find 

exp [ — ?i/y(7r') — l\i{l -t- 5) v/r!.] < F(?'| 

< exp [-n//(7r') -t- A'.(l -!- «) Vn] (12-90) 
where H(v') = — log ttJ 


A:i > 0 is Wolfowitz’s constant. 

12-13. Estimation of Bounds. A final phase of Wolfowitz’s techni(iuo 
requires an estimate of the number of words in the set, V of selected words 
and in the total set V of words generated by U. To this end, w(( must 
translate these probabilistic bounds into bounds on the number of ele- 
ments of a set. The following lemma serves this purpose. 

A Combinatorial Lemma. Let A C V bo some finite set A of elements 
u of a probabihty space satisfying the inequalities 

0 < m < P[u] <M for It G A (12-91) 

0 < r < F{A1 < 72 < 1 

Then, Ar(A), the number of elements of the set A, satisfies 




R 

m 


•See, for instance, G. H. Hardy, J. E. Littlewood, andG. Polya, "Inequalities,” 
chap. 2, Cambridge University Press, New York, 1952. 



420 


SOME RECENT DEVELOPMENTS 


Proof, By definition, 

P\A\ = I P{m) 

uSA 

Then it is evident that 


N{A) -m < PIA] < N(A) M 

- < N(A) < - 
m ~~ ~ m 


(12-92) 


Let B2{v\Tr) be the number of n sequences generated by all tt sequences. 
Then in the above lemma we let 


m = exp [ — n//(7r') — ki{l + 5) \/n] 

M = exp [ — n//(7r') + /ci(l + 6) (^2-93) 

R = 1 r = He 

The following inequality thus holds: 

exp [nHiw') - fci(l + 5) y/n] < B 2 {v\Tr) _ 

< exp [nH{7r') + /ci(l + b) \/n] (12-94) 

or by letting 1 x 2 = ki — log 6 > 0: 

exp [nlI{T') — /;2(1 + 5) \/'n\ < B 2 {v\w) 

< exp [nlliir') + /c 2 (l + 6) Vn] (12-95) 

Our next task is to find similar bounds for Bi{ [?/), the number of n 
sequences generated by any tt sequence. The computation can be 
achieved in a manner analogous to the foregoing. That is, first we find 
bounds on probabilities of the associated sets and then apply the last 
lemma. For instance, the probability of any n sequence generated by 
any tt seciucnce is 

P{v\li\ = n 

In a direct manner we find 

P{w|u} < exp [n^ IT, P|j|il logP|j>’l 

- y/n (2a + 6) £ log ] 

P[v\u\ > exp [n log P{j\i\ 

+ Vn (2a + «) ^ Vmj\i] log P{j\i} ] (12-96) 

Therefore, 

exp [n ^ T^H{ 10 — y/n {2a + 5)/c3] < Bi( [i/) 

< exp [n ^ 7 ri//(i;|w) + (2a + 6)fca] (12-97) 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 421 

where ks is an appropriate positive constant and H( jz) is Ihe channers 
conditional entropy for a specified i. 

12-14. Completion of Wolfowitz’s Proof. Now we are in a position to 
present Wolfowitz’s proof of the fundamental theorem of information 
theory for a discrete memoryless noisy channel. 

Theorem. There exists a code book containing at, least N words of 
length n such that for any arbitrarily spe(‘ified probability of error \ 
(0 < X < 1) there is a decoding scheme with a uniform error probability 
X and 

N > exp {nC — k y/n) (12-98) 

where k is a constant depending on X but not on the channel and C is 
the channel capacity. 

Proof. As usual, we partition the receivi?ig space into disjoint regions 
A], A 2, . . . , An such that, if the scqueiu'c u, is sent, 

X., = 1 - P{v, G < X f = 1, 2, . . . , AT (12-99) 

From previous discussions we know that such a cock* exists. The major 
question here is to tind out if the code book can be made as large as Ecj. 
(12-98) suggests. Attention is called to Ecj. (12-82), where 5 can be 
conveniently chosen as a \/2/Xj or 

' = - 
‘ “ 5^ 2 

Now let 1/, (i = 1, 2, . . . , N) be tt setiuences achieving the maximum 
of transinformation C. The detection regions will be selected as 

V ^ A^ if V is an n sequence generated by u, i = 1, 2, . . . , N 

Furthermore, we assume that the code is a maximal code in the sense 
that an element cannot be added without raising the probability 

of error X. Let Uo be a tt se(iuenc(‘ which achievc's the maximum ol trans- 
information through the channel. II no is not included in the set Wi, 
1^2, , 'UNj then 

N \ 

P\{v generated by Uo) G O Ak\ > 73 (12-100) 

Indeed, if this was not the case, one could add 1/0 to the code book and 
associate with it the corresponding generated words ol length n which 

N 

are not in U Ak- According to previous lemmas, the number of 

ifc-i 

N 

n sequences in the set W is not less than 

?^exp[nW) + «)V^] 


(12-101) 



422 

and not more than 


SOME RECENT DEVELOPMENTS 


N exp [n ^ |i) + \/n {2a + 5)^:3] (12-102) 


where it' and it are, respectively, the output and the input probability 
row vectors achieving maximum transinformatiou. Thus we find 

N > exp {nC — \/n [/ci(l + 6) + (2a + B)kii]\ 

> exp {nC ^ k \/ n) (12-103) 
with k = fci(l + d) + (2a -f- 8)ki — log -^X 

From this proof, one can conclude that, for any specified word length 
n, one can find a code book containing at least words arid the 

probability of error is also of the exponential form Ffir the 

detail and the values of constants Ci and C 2 in terms of the previous^ con- 
stant, see Wolfowitz (IV). \ 

A strong converse of the above theorem was first given by Wolfowitz. 
The converse of this theorem states that for a given n and X it is impossible 
to devise code hooks with a given arbitrarily small error and containing 
more than exp (nC + k' yj n) words. The constant k' depends generally 
on X but not on the channel. For the sake of refcreiujc, the two versions 
of the converse theorem are given below\ 

Weak Converse. The fundamental theorem does not hold foi‘ 


log N < 


nC + 1 


(12-104) 


A proof for the weak converse can be directly derived from Kqs. (12-12). 


nC > H{U) - H{U\V) > log A - X log (A ~ 1) -1 

X > 1 - 

log N 


Or, if we let N 


lim X > ] — 


nC + 1 
nil 

C 


(12-105) 


(12-106) 


Obviously for sufficiently large n, X — > 0 if // < C. This is not the case 
for N > C. 

Strong Converse. For any speisified 0 < X < 1 we have 

lim sup - log N(,n,\) < C (12-107) 

71 — > SO 

For proof of this important theorem and its application to BSC see the 
cited reference and Feinstein (II). 



THE FUNDAMENTAL THEOREM OF INFORMATION THEORY 423 

The fundamental theorem of discrete memoryloss channels and its 
converse have been extended to the case of channels with finite memory. 
Channels with memory have been studied under stationary and non- 
stationary regimes. These extensions require a degree of mathematical 
sophistication beyond the scope of this text. The mathematically 
inclined reader is in the fortunate position of having at his disposal a 
comprehensive literature on this subject. During the past few years a 
number of competent mathematicians in the United States, the Union of 
Soviet Socialist Republics, and Europe have made considerable headway 
in solidifying and generalizing Shannon’s original ideas. Besides the 
mathematical references mentioned earlier in this book, a number of 
articles are included in tlic Bibliography at the end of the book. Articles 
by the following authors are among those of special interest: Blac^kwell, 
Breiman, and Thomasian; Dobrushin;* keinslein; Gerfand, Kolmogorov, 
and laglom; Ge.rfand and laglom; A. Rt^nyi; and Rozenblat t-Rotr. 

From an engineering point of view, it is of prime interest to examine 
the physical context of the recent mathematical developments. In 
light of this, the ultimate performance and the merit of our communica- 
tion systems and techniques should be reevaluated. However, our 
journey has come to an end. It is hoped thai the present work will 
stimulate the reader to pursue further investigations leading to possible 
applications of the theory. 

* R. L. Dobrushin, rhuK'ial I'Viiuihit.ion of Sluiniioirs Fundjuin'iital Theorem in 
Information Theory, Dokltiflii AkafL Nauk vol, 12G, no. 3, pp. 474-477, 1959; 

translation in AulouiaUon Exprtsa, January, 1990. 



CHAPTER 13 


GROUP CODES 


13-1. Introduction. The fundamental theorems of information theory 
described in the preceding chapter are, in essence, realizability theorems. 
They prove the existence of a code book for the transmission of informa- 
tion at a rate not higher than the channel capacity with any arbiwarily 
small error probability. They do not show, however, methods for obtain- 
ing such valuable encoding procedures. The synthesis of encodinj^ pro- 
cedures is in itself a subject of great professional interest with a rabidly 
growing technical literature. In fact, a far greater number of papers on 
coding theory are available than on information theory per se. The 
contribution of these papers has created an area of investigation (gen- 
erally referred to as coding theory) which appears at present to be dis- 
tinct from information theory. 

This chapter is not a secpiel to the preceding chapter. From a 
pedagogical point of view, this chapter may be included at any time 
subseciuent to Chap. 4, whenever a digression from information theory 
would be welcomed. 

The area of information-processing machines, such as digital computers, 
is a most important field of application for binary codes. Fortunately, 
at present, there is a large variety of error-correcting binary codes availa- 
ble. The discovery of these codes has been greatly stimulated by the 
development of information theory, although coding theory seems to be 
somewhat directed toward the mechanical implementation of the codes. 

The abundant coding literature appears to be undergoing a major 
fundamental organization. A shift toward general methods rather than 
individualized code books seems imminent. The work of a large number 
of scientists, such as R. W. Hamming and D. Slcpian, to mention just two, 
has contributed to this mathematical integration.* However, relatively 
speaking, the mathematical theory of codes appears to be in an early stage 
of development. 

Our objective in this chapter is to present, in brief, some developments 
in the field of the so-called group codes. This will provide the reader an 

* For a comprehensive list of contributors see the bibliography of coding prepared 
by A. B. Fontaine and W. W. Peterson, Coding Newsletter 60.2. 

424 



GROUP CODES 


425 


opportunity to examine one of the possible present applications of 
information theory. We shall not attempt a complete discussion of the 
practical use of these codes. The read(»r interested in the theory of 
codes and their implementation will find adeciuate treatment in the 
literature. 

We begin with a brief presentation of systematic or minimum-distance 
parity binary codes discovered by Hamming and supplemented by a large 
number of interesting papers in the past decade. The study of syst ematic 
codes may be carried on in several directions, for instance: 

1. The size of the code book 

2. Methods for the selection of code words 

3. The error rate in the transmission 

Item 1 has been extensively studied. Many combinatorial results 
have been derived, and more seem to be forthcoming. The study of the 
number of elements in different systematic codes is, in essence, of a 
combinatorial nature. Item 2 is to indicate how one may select a suitable 
code book. The group c-ode of Slepian appears to be the first important 
work in this direction. Slepian's work makes use of some concepts of 
groups, fields, and rings. The third direction constitutes an evaluation 
of the rate of transmission of information for some systematic- codes (for 
instance, the works of P. Elias and D. Slepian). As far as the objectives 
of this chapter are concerned, we Avish briefly to ac(|uaint the reader with 
these developments. To this end, an elementary acquaintance with the 
concepts of groups, fields, and rings is highly desirable. Sections 13-2 
and 13-3 will provide such a necessary review background. Section 13-4 
applies the content of th(j previous sections to systematic codes in gen- 
eral. Section 13-5 describes Hamming’s codes in brief. The important 
concept of group codes is described in Sec. 13-(). Thi^ immediately fol- 
lowing section is devoted to a detection scheme for group codes. Section 
13-10 presents some of the results obtained on maximum size of Hamming 
codes. The material of the latter section is not directly related to the 
rest of the chapter, although of great interest to those engaged in research 
in coding theory. 

13-2. The Concept of a Group. A set of elements denoted by G is 
called a group if it has a well-defined binary operation, which we denote 
by O, and an equivalence relation which is equal to satisfying the following 
conditions, known as group axioms: 

1. Closure: For every a, h 

a O b ^ G 

2. Associate law of products: For every a, h, c EG 

(a O b) Q c = a Q {b O c) 



426 


SOME RECENT DEVELOPMENTS 


3. Existence of an identity element e G G: For each a 

aO€=eOa=a 

4. Existence of an inverse element: For each a ^G there exists an ele- 
ment a"* G O' such that 

a O O a = c 

If, besides these axioms, the elements of the group satisfy the commu- 
tative law of products, then the group will be referred to as a commutative 
or Abelian group: 

a O b = b O a 

As a simple example, note that the rational numbers, with the exclusion 
of zero, form a group under the ordinary multiplication and definition 
of the equality of numbers. The number 1 is the identity element! of the 
group, and the exclusion of 0 guarantees the existence of an inverse for 
every member of the group. Furthermore, these numbers forin an 
Abelian group. 

As a second examples, one can verify that the set of the following six 
matrices forms a group under ordinary matrix product and eciuality. 


1 0 1 0 1 

1 o ' 

■ 0 r 

- 1 - 1 

■-1 -1 

.0 ij U 0 . 

-1 -1 

-1 -1 

1 0 . 

.0 1 . 


In fact, axioms 1 and 2 can be directly verified. The first matrix may be 
selected as the identity element; since all matrices are nonsingular, the 
verification of the fourth axiom is immediate. 

Two groups, C and C\ are said to be simply isomorphic or eguivalent 
if there is a one-to-one correspondence between their ehnnents such that 

if a ^G a' E. G' 

and b eg b' E 

then a Ob EG-^a^ O V E G' 

If a nonempty subset of the elements of a group under the same product 
law themselves form a group, they are referred to as a subgroup of the 
original group. 

The associative law of products implies tlmt we can form different 
powers of elements of a group : 

a Q a = 

o O O a = a® 


A group which consists of a single element and its powers is called a 
cyclic group. Evidently, cyclic groups are Abelian. The order of a 



GROtTP CODES 


427 

group is the number of elements of the group. A group of finite order is 
referred to as a finite group. 

hguivolencc It, elation, (liven a set we say that a relation of eejuiva- 
lence has been estalilished on elements of <S if it is possible to assert 
whether any two arbitrary elements a and h are ecpii valent or not. Two 
arbitrary elements a and b of S are said to hav(' an ecpii valence relation 
(a ^ 6 in symbols) if the relation satisfies the following conditions: 

1. Reflexive: that is, true of a and a 

2. Symmetric: that is, true for h and a if it is true for a and h 

3. Transitive: that is, true for a and c if it is true for a and h and for 
h and c 

An equival(*nce relation divides the elements of a set into disjoint subsets 
of equivalent elements. 

Now we shall apply the concept, of eciuivalence to groups. T.et If be a 
subgroup of a group G. Two (dements a and b oi G are said t,o be right 
congruent if, and only if, there (‘xist.s an element c in II such that 

a — b Q c 

Similarly, by left congriKuice we mean a = c O l>- 

A congruence relationship with respect to a. subgroup Jf of a group G 
satisfies all three recpiirenients of e(|uivalenc(‘. Idiiis, one can see that 
the elements of G are divided into e(iuivalent, classes. Thes(‘ e(iuivalent 
classes are refern^d to as cosets. 

For the proof, let 

aGG b G G 

Then a ^ b is an equivalence relation if and only if ob' ' G If- lu fact, 

1 . a G implies a, ~ a arid aa~^ ~ e G H ’ 

2 . ab '^ G If implies ha ‘ G H as ba ^ ^ (a/r’)~‘. 

^ ^ irnplyac'i G II mice. (ab"^){bc~^) ^ acr^. 

It can be shown that the order of a subgroup // of a finit,(‘ group G is a 
divisor of the order of G. In fact, let the order of G and II be g and h, 
respectively. J^et the element, s of II be 

1 7/1 = \ai,a2, . . . 

If 6 ] is an element of G but not of If y then the set lf\ of order h has all 
distinct elements which are also distinct from the elements of If. 

{Ill} - {Ih O Ih o 02, . . . , O ah} 

If (77) and {7/i| do not exhaust C7, select an element b^GG but 62 G H 
Gl III and consider the set Ih: 

[ f - j ,] = j?>2 o «1, O 02, . . . , O aU 



428 


SOME RECENT DEVELOPMENTS 


Of course, elements of [ 7 / 2 ) are distinct from elements of {77i} and {i7}. 
Continuing this process, we find that the elements of G will be divided into 
k cosets each having h elements. Thus 


g = hk 


13-3. Fields and Rings. In the preceding section, we have outlined 
the definition of a group and a subgroup. This section presents the basic 
definition of afield and a ring. While the content of Sec. 13-2 is essential 
for the study of the elements of group codes, the material of this section 
is rather optional. Its inclusion may prove to be a convenient reference 
item for those who wish to consult some of the most recent articles on 
group codes. i 

A system of elements is said to form a field if (the cciuality of the ele- 
ments) two operations (say addition and multiplication) are defincaand 
if for any a, b, r, a: G the following laws are valid. (The laws given 
here are not presented in the form of a minimal set. of postulates. Slight 
redundancy has been injected for pedagogical reasons.) 


1 . Closure \ n b E: F 

2. Associativity: 

(a h) c — a (b r) 

3. Additive identity: Inhere exists 
z E F such that 

x + z = z-\-x = x 

4. Additive inverse : For eacih x E F 
there exists x* E F 

x + x* = X* + x = z 


] . Closure : a ' b E F 

2. Associativity: 

(a • 6) ■ c — a • {b • c) 

3. Multiplicative identity: There 
exists u E F su(;h that 

X ■ u = ii • X = X 

4. Multiplicative inverse: For each 
X 9 ^ Zj there exists an inverse 

j.-~i ^ p 

X • ’ X = u 


5. Commutativity: 5. Commutativity: 

CL h = b n (1 ’ h = h ' G 

Distributive Laws: 

a ’ {b + c) = a ’ h + a ' c 
(?? + c)-a = 6- a + c- a 

A simple example of a field can be given by the set of all complex 
numbers a -f *%/ — 1 b under ordinary addition and multiplication. _The 
identity element z in this case is (0 + \/ — 1 0) and ?^ = 1 + * 1 0. 

As another example, let us investigate whether under ordinary matrix 



OROUP CODES 


429 


operations the set of all matrices of the type 

y ^ j y real numbers 

forms a field. To this end, let B = ^ j and note that the closure 

and associative laws are satisfied: 



A + B ^ B + A = 
AB = BA = 


X + a 17 + ^1 

+ u- + aj 

xa — yh xb + ija 1 
— (xb + i/a) xa — yb J 


As to the zero and unit element, one may write 



0 

0 


1 

0 

z = 

0 

0 

U — 

0 

1 


The multiplicative inverse law has not yet beeji examined. For the 
moment, let B be the inverse element, AB = then, since the multiplica- 
tion and the unit matrix are already deliiied, vve find 


xa — yh \ 

< 

xb + z/a = 0 

1 

Thus if B is selected as the inverse matrix in the familiar sense, the inverse 
matrix is also of the desii*ed form. Since this inverse exists for ev(‘ry A 
except A = Zy the above set of matrices forms a field. 

The idea of a mathematical system called a ring simply follows from 
the concept of a field. All that is required is to relax the laws of multi- 
plicative identity, inverse, and commutativity. Removal of laws 3 to 
5 in the right-hand column in the list of field postulates leads to a ring. 
Obviously, a ring is a more general relationship than that of a field. For 
instance, the set of all n X n matrices with real elements forms a ring. 

13-4. Algebra for Binary n-digit Words. In the light of the above 
development, it becomes apparent that elements of the set consisting of 
all n-digit sequences form a group under 0. In fact, for any n-seciuence 
word pairs a and b that are members of Sn we have 

a (J) h = a ® h ^ Bn 
(o O b) O c = a 0 (f> © c) G 

eOa = e0a = a where c = (000 • • ■ 0) (13-1) 

a O ar^ = a ® ar^ = e 

= Cl G 

Note that the above n sequences form an Abelian group of order 2”. 


a = 


x- + y‘^ 



430 


SOME RECENT DEVELOPMENTS 


In order to verify whether the ring algebra holds, take the above 
0 definition for the addition operation and define the product of two 
words a E and h E Sn as a word a • b whose digits are the product of 
the inatehing digits of a and h. With this rule in mind, we note that 

a • 6 = • a E 

(a -h) ■ c = a - {h • c) (13-2) 

a ' {h O c) — a ‘ b ^ a • c 

The inversion law is not satisfied. The algebra is a commutative ring 
algebra. 

If the word </> containing n zero digits is taken as the point of jorigin 
of an n-dimensional (uielidean space 0 = (0,0, . . . ,0), the weighty of an 
clement a E in defined as the number of Ts in a: 

lKa,4>) = t){a ©</.) = l|all* 

Then the following relation is self-evident: 

\\((i 0 r) 0 {b 0 r)|| - ll(a 0 b) 0 {c 0 c)\\ = ||a 0 /j1| (13-4) 

D{a,b) is also referred to as the Hamming distance between points a 
and b. Let be the Hamming distance between ii and e; then 

I){u,v) = 1){1',U) = II w © (’ll , ■ 

D0,,u) = ||,( © »|| = \\4>\\ = 0 

The following lelations are easy to j)rove. The readei’ may wish to 
offer a proof as an exercise. 

||a ■ b\\ < min [l|a . . 

11^0 ^11 -INI + 11611 -211a -Ml 

A set of ?i-digit words with k parity checks may form a subgroup of the 
Abelian group Sn, as will be discussed shortly. I^’or example, in (he case 
of n = 5 and k — 3, we have 2^ words in the code book which form a sub- 
group of all possible 32 five-digit words. 

c = 0 0 0 0 0 //4 = 1 1 0 0 1 

1/1 = 00111 z/i, = 0 0 t 0 0 

'^2 = 1 1 1 1 0 zy.6 = 1 1 1 0 1 

7/3 = 00 0 1 1 7/7=11010 

The above eight words could be considered as points in the five-dimen- 
sional euclidean space. Thus, according to Sec. 13-2, the group can be 

* For })revity \vc make use of ||a|l in lieu of “weight of a ” This notation should 
not be confused witli the ordinary euclidean norm. 



GEOUP CODES 

expressed in terms of its eosets as 


e 

Ui 

Ma 

Us 

U4 

Ui 

?/e 


00000 

00111 

11110 

00011 

11001 

00100 

11101 

11010 

10000 

10111 

OHIO 

10011 

01001 

10100 

01101 

01010 

01000 

01111 

10110 

01011 

lOOOJ 

01100 

10101 

10010 

00010 

00101 

11100 

00001 

non 

00110 

11111 

1 1000 


431 


Example 13-1. 

numbers 


Verify the clifterent operjilioiis defined in Sec. 18-4 for the binary 
a = HlOl = 01000 e = OlOOl 


Solution. Tt is easy to show that o, h, and c are elements of the binary group S^ and 
that the rules of ring algebra hold. 


rz © 6 = 10101 fl © e = lOlOO 
{a eh) e c ^ 10101 © 01001 = 1 1 100 
a © (/i © f) = 11101 © 00001 = 11100 
e © a = 00000 © 11101 = 11101 
a © a-' = lllOl © 11101 = 00000 
a h = 01000 h • r = 01000 
{a • b) - e - (01000) - (01001) = 01000 
« . {h c) - (01000) • (01000) - 01000 
a - {h e c) = (lllOl) • (00001) = 00001 
a 6 © a • (• = 01000 © OlOOl = 00001 


h e c =- 00001 


a ■ e = 01001 


To verify the observations about the weiglit of these elements, one writes 

Hall - D(a,0) = 4 116|| = 1 lle|l = 2 

Ha © b\\ - 8 !1« e 4 =2 116 © r|| = 1 

H(a © e) © (6 © r)|| = HlOlOO © 00001 11 = HlblOlH * 3 

Also note that, in ac(;ordanee with Kq. (18-6), wa have 

\\a . /,H = 1 1 = min 14,1] 8 - 4 + 1 - 2 X 1 

13-6. Hamming’s Codes. Before proceeding to a discussion of group 
codes, it now may be worthwJiile to pause and apply our newly acquired 
concepts to Hamming’s codes. This slight digression should sited more 
light on the material discussed in the latter part of Ohap. 1. 

We consider n-digit binary w^ords as points in the //-dimensional 
euclidean space. By a set A {n,d) wt imply the optimal set ol such points 
with a minimum mutual distance d, that is, a vsel containing A elements 
such that no additional n-digit word can be imuirporated in the set with- 
out destroying the property of minimal mutual distance d among its 
elements. For example, the elements l)elow constitute a set i4i(0,.3). 


000000 101010 
010101 mill 



432 


SOME RECENT DEVELOPMENTS 


Of course, generally we may have several optimal sets i4i(n,c?), A^in^d)^ 
etc. For instance, the set 2(6,3) below is also an optimal set: 

000000 111000 

010101 001011 

100110 111111 

Among all optimal sets A{n,d), the one that contains the largest number 
of elements will be denoted by B{n^d). For example, it can be shown 
without difficulty that there are eight elements in ^(0,3) : 

000000 010101 111000 101101 

100110 noon onno ooion 

In the light of the developments of Secs. 13-1 to 13-3, we now rcexkmine 
these optimal sets of elements and the technique associated witp the 
Hamming codes. ^ 

A set S of binary n sequences is said to be a Hamming d-miniiAum- 
distance set if for every two distinct words u and v in B we have 

0 *'|| ^ d d is a positive integer 

Let Sn denote the set of all 2" binary words of length n. Each word has 
a point representation in space and may be considered as a vertex of a 
unit cube in euclidean n space. A set of words B C may be callpd a 
5-error-correcting or 5-error-detecting code if for every pair of distinct 
elements 'a* and Uj in S we have, respectively, 

> d = 25 + 1 (13-7) 

D{u^jUj) > d = 25 

The set of all n sequences Sn forms an Abelian group under the modulo 2 
addition. If the set is a subgroup of the group S„, then S is called 
a group code. As will be shown later, a d-minirnum-distance code need 
not be a group code in general. 

Hamming’s parity distance code (called systematic code) can be effec- 
tively used for transmitting information in the presence of independent 
noise. The general concept of the transmission method is to consider the 
correction of 5 or less independent errors for a specified word length. 
Then, by letting d = 25 + 1, choosing the maximal set B{ri,d), and accept- 
ing the minimum-distance technique in decoding, one is able to correct up 
to 5 single errors. A point a in the n space which is closer than 5 to a trans- 
mitted signal Uk will be decoded as Uk. A discussion of Hamming’s 
systematic codes and a tabulation of B{n,d) for n < \7 and 5 < 6 are 
given by A. E. Laemmel* (see also Secs. 4-13 and 4-14). 

♦A. E. Laemmel, Efficiency of Noise-reducing Codes, “London Symposium on 
Communication Theory 1952, pp. 111-118, Academic Press, Inc., New York, 1953. 
See also N. Wax, IRE Trans, an Inform, Theory^ vol. IT-5, pp. 168-174, 1959. 



GROUP CODES 


433 


In the abov© treatroent w© have disciissad th© class of error-corr©cting 
codes. The same general idea is applicable to the class of error-detecting 
codes. That is, given a 6-error-detecting scheme, the minimum distance 
among the words of this code is not less than c. - ‘2S. The elimination 
of a character from every code word can at most reduce their distances 
by one unit. Thus we obtain a code, (n — 1, d — 25 — 1). In fact, 
the relation between the two classes is 


.d(n,f) = d(>i — 1, c — 1) 


Some of Hamming’.s results are as follows: 
Hamming’s theorem ; 


(I) 

(II) 

(III) 

(IV) 


/?(»)., 1) = 2- 

Bin, 2) = 2”-^ 

Bin, 2k) = Bin - 1, 2fc - 1) 


Bin, 2k+ \) < 


1 + 


(:) 


+ ■ * * + 


(13-8) 


proof of the validity of these relations follows directly, for instance, 
from error-correcting considerations as discussed in Chap. 4. 

A proof of (IV) may be suggested by contradici.ion. Lei S he the 
set of all the representative points of th(3 code book and Sj, C ^ a»d 
*8,/ C two subsets cont aining all points in *S„ noti farther than 6 from two 
distinct code points x and jy, respectively. FAudently r\ = </>; 
otherwise their intersection would contain at least one point z G S such 
that 

\\z — x\\ <5 \\z — fj\\ < 0 


Such a point z cannot belong to S since by hypothesis \\x — /y|| > 25. 
Note that each sphere contains 1 + -b ■ • ■ + points; 


hence the number of points in S is precisely the number of such spheres. 
Since the total number of points in Sn is 2"^, then the number of spheres 
cannot exceed the right-hand side of (IV). 

Hamming's code can best be defined in terms of its parity-check 
matrix [A] = [a]. This is a matrix with element values restricted to 
0 and 1, having n — k rows and k columns. Each word to be trans- 
mitted over BSC consists of n — /c parity checks and k information digits. 
Without loss of generality, we may assign the last n — k digits to parity 
checks. 


U — aiy as, , Oik] /32, . . . , 



434 


SOME RECENT DEVELOPMENTS 


For a given parity-check matrix and a k sequence, a check digit of 
u is obtained through the following equation : 

k 

ft =-■ 2 i i,2, . . . ,n - k (mod 2) (13-9) 

Symbolically wc may write 

m = f^4]H (13-10) 


where the matrices [i3], [A], and \a] are (n — k) X 1, (a — k) X A', and 
fc X 1, respectively. When [a\ runs over all possible k s(‘(]ueiices, Eq. 
(13-10) provides the parity-checik parts of the Avords to l)e i ransiiiitted. 
As an examjjle, consider the case of a single-error correcting block code 

n = 7 /c = 4 n — /c = 3 


The matrix A may be selected to be 




10 11 
1110 
110 1 


For instance, if the informat ion digits of a word are specitied as o'iq; 2 « 3 <^ 4 , 
then the parity checks will be given by 

fix ~ ai + oa + 04 

^^2 -=01 + 02 + a,i 

1^3 — ai 0^2 + 04 

The problem of devising systematic codes with a specified mutual 
distance is thus equivalent, to that, of finding a suitable parity-check 
matrix A such that it generates a set of words with the r(Hiuir(*d property. 
This problem has been discussed at length in the coding literature (see, 
for instance, ]\L Colay, C. E. Sacks, R. Chien, E. J. Mc( Huskey, Jr.,* 
and A. G. Koinheim). 

As a second example, consider the case of n = 5, k = 2, c/ = 3. The 
following 3X2 matrix is a suitable parity-check matrix. 


[^4] 


1 

1 


1 

0 

1 


* McCluskey states that a parity-chock matrix i.s suitable for a niinimum-distance 
code d if, and only if, 

1. The weight of each column is greater than or equal to d — 1 . 

2. The weight of the sum of j columns is greater than or equal to d — j. 

The test procedure for this is by no means brief. McCluskey notes that the synthesis 
of the parity-check matrix is equivalent to a linear programming problem. 



GROUP CODES 


435 


The check and the information digits are related by equations 

01 = ai + a2 

02 = ai 

= ot2 

When all possible four binary words are considered, the following four 
words with a minimum mutual distance of 3 are found. 


Ui 

U2 

Ui 

Ui 


Oil a2 01 02 03 

0 0 0 0 o' 
0 110 1 
10 110 
110 11 


(13-11) 


While any A{njd) may be selected as a distance code, it may not neces- 
sarily form a group code. The selection of the parity matrix for a group 
code and the associated detection scheme will be further described in the 
next section. 

13-6. Group Codes. While studying Hamming's parity-check codes, 
Slepian considered the interesting problem of finding a subset S C Sn 
such that the elements of S form a subgroup of Sn. Such codes are 
referred to as systematic group codes. Errors in these codes can be recti- 
fied by the parity-check method. The existence of such a subgroup 
and its determination will become apparent shortly. 

Let [A] be an appropriate matrix for a parity-check code with the 
general code word u = (ai,a2, . . . ,ak; 0i,02y ■ ■ ■ j0n-h)- We show 
that if [a] runs over all possible 2* information sequences the code thus 
obtained is a subgroup of Sn. Indeed, if u'- and u'^ are two code words of 
S, their sum modulo 2 is also a code word of S. Symbolically we may 
write 

“‘-M -[;!.] “’-M-UJ 

where and are two column matrices describing any two information 
sequences and 0^ and 0'^ are their corresponding parity columns; hence 




0^ 0 0^\ ^ lA{a^ © a^)_ 


(13-13) 


Furthermore, when the k sequence (00 • • ■ 0) is taken as the information 
part of a code v/ord, the check digits of that word will also be a zero 
(n — k) sequence. If the zero n sequence is taken as the identity ele- 
ment, then every code word in S is its own inverse. Elements of S 
thus fulfill all the requirements of a group. 

Next we show that, if S C is an optimal code with the minimum 



436 


SOME RECENT DEVELOPMENTS 


distance d, and a any binary n sequence not in S, then Si = a © S also 
constitutes an optimal code with minimum distance d. The validity 
of this statement can be examined by letting 

{S) = 1^1,62, . . . fim] (13-14) 

{Si} = {a © bi, a © 62, . . . , a © h^} 


and reasoning along the following lines. If Si is not optimal, one may 
find at least one element (:r G Sn, x G Si) such that its distance from 
every element of Si is more than d. That is, 

D(x © a © ?))t) > d /c = 1, 2, . . . , m |[13-15) 

This inequality implies that 

D{x © bk) > d /; = 1, 2, . . . , m 

a fact that is in contradiction to the optimal character of S. 

The 2” elements of S„ can b(» expressed in terms of the element su)f S 
and its cosets as shown below: 


c = 60 = Ui 

U 2 

M, 

Vn 

bi 

hi © Vo 

61 © V.t 

b] 0 Un 

Ih 

bo © U 2 

l>i 0 1l:i 

bo 0 Vy (I 3 -]()) 

bfn 

bfn 0 *^^2 

l>m ® 1(3 

bm 0 Njv 


where AT = 2'* — 1. 

In this development every element of S„ appears once and only once. 
Each row is a coset, and the elements directly uiul(n' e are the cosed, lead- 
ers. Coset leaders may be chosen one by one, subject only to the con- 

straints of Eq. (13-17). 

hi G ^ 

Ih G iS KJ Si) where Si = h 0 aS (13-17) 

bit G {S VJ ^Si yj S 2 ) where S 2 = bo 0 aS'i 

Observe that if a coset leader is replaced by any element in the coset 
the same coset will result. Eor example, the following two cosets are 
identical. 

J >2 ^2 © '*^2 ^2 0 '^^3 ■ ■ ’ bo © Un 

62 © U 3 62 0 0 U 2 hi © • ^2 0 W'3 0 Un (13-18) 

Thus, without loss of generality, we may conveniently arrange the ele- 
ments of each coset so that the weight of none of its elements is less than 
that of the coset leader. 

INI < \\h + u,\\ j = 2, . . . ,Ar 


(13-19) 



GROUP CODES 


437 

When the words of a group code book arc arranged in the way described 
above, we say that the words are in standard order. The expansion given 
at the end of Sec. 13-4 is that of a standard array. Of course, we have 
not yet described what decoding technique is going to be employed for 
transmission over the binary channel. This is described below. 

13-7. A Detection Scheme for Group Codes. Assume 1 hat the vocabu- 
lary of a transmitter consists of = 2^^ binary n sequences : 

S = [vi = e, W2, . . . , 'lA 1 

Furthermore, let those words form an optimal set of elements with mini- 
mum mutual distance d. For convenience of calculation, we assume that, 
all words u G S are transmitted with e(iual probability over a HSC. At 
the receiving end, we may receive any one of 2” distinct words v G ^n. 
We know that S is a subgroup of *S\,. Thus, the words of Sn may be 
developed in a standard array according to elements of S. Our problem 
now is to devise a suitable detection. At the receiving end, if a word in 
the ?th column of the standard array is received, let t.he decoder con- 
jecture that the word v^ was transmitted. In the following discussion it. 
will be shown that this conjecture is in agrtHuneut with what we have 
called a maximum-likelihood (minimum-distance) decoder. Indeed, if 
the word b, 0 Uj is rec.cnved, the detector will search for a iik such that 

© Vj, Uk) < D(lh 0 w.;, Vr) r = J, 2, . . . , AT (13-20) 

(Of course the ecpiality may hold feu* several words; in this case some 
ambiguity will remain in the method.) But 

D{b, 0 lij, Uk) = D{lh 0 Uj 0 Uk, e) = 116, 0 Vj 0 7//, 11 > 116,11 (13-21) 

where the last step follows because u, 0 Uk is some other element of the 
subgroup and th(' array (13-16) is in standard order. Thus if t he word Uj 
is selected to correspond to 6, 0 Uj, then Fq. (13-1!)) will l)e satisfied. 

D(6, 0 Uj, Uj) = ||6,|| < I){b, 0 Uj, Uk) (13-22) 

Any element of a standard array is at. least, as close to the element on 
the top of its column as to any other transmitted word. The probability 
of error for the transmission over a BS(" can be readily computed. If a 
word is transmitted and the word hk 0 n, received, the detector 
does not commit any error. This is, for instance, the case when the noise 
is additive and has the structure of one of the coset leaders. The proba- 
bility of such a situation is 

Qi = Plu = Ui\v = Ui yj (u^ 0 6i) W (lit 0 62) W • • ■ 

VJ (lit 0 bm) } 

p\H\\gn-\\bk\\ 



(13-23) 



438 


SOME RECENT DEVELOPMENTS 


Since the coset leaders are of minimal weight, in this detection scheme the 
probability of correct detection Qi is made as large as possible (for p < H)- 
P^urthermore, we assume that all words are transmitted with equal 
probability. Thus the average error probability becomes 1 — Qi. 

P'ollowing the above considerations, as first pointed out by Slepian, one 
arrives at the interesting result that for a group alphabet the suggested 
detection scheme is a maximum-likelihood detector, providing the least 
average probability of error for the transmission of a specified Ar(n,/c) 
vocabulary over BSC. I'or a minute examination of this statement along 
with the computation of a bound on the corresponding average error 
probability the reader is referred to Feinstein (I). 

13-0. Slepian’s Technique for Single-error Correcting Group Codes. 
We begin with a simple illustration describing Slepian's (n,/c)\ code. 
Consider Hamming’s single-error correcting procedure for n = p and 
fc = 2. There are four possible information 2 sequences: 


( 00 , 01 , 10 , 11 ) 

A suitable 3X2 parity-check matrix is selected as 


[A] 


1 0 
0 1 
1 1 


The parity checks are found to be 


1 

0 

1 


0 0 
0 1 


1 

0 


:] 


0 0 11 
0 10 1 
0 110 


Thus, the four 5 sequences to be transmitted over the BSC are 




"0 

0 

0 

0 

0 " 

U2 


0 

1 

0 

1 

1 

Us 


1 

0 

1 

0 

1 

Ua 


1 

1 

1 

1 

0 


Note that 


Ikll >3 fc = 2, 3, 4 


If we have the definite knowledge that not more than a single error has 
occurred, we can certainly detect and correct that error. When more 
than a single error may occur, this detector still may be used. For this. 


we select the = 5 words with the lowest possible weight as coset 
leaders and develop the group according to the elements of S. 



GROUP CODES 


439 




Ui 





Ui 



0 

0 

0 

0 

0 

0 

1 

0 

1 

1 

0 

0 

0 

0 

] 

0 

1 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

1 

1 

0 

1 

1 

1 

0 

0 

1 

0 

1 

1 

0 

0 

1 

1 

1 

0 

0 

0 

1 

0 

0 

1 

1 




Wa 





Ui 



1 

0 

1 

0 

1 

1 

1 

1 

1 

0 

1 

0 

1 

0 

0 

1 

1 

1 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

0 

1 

0 

1 

1 

1 

0 

1 

1 

0 

1 

1 

0 

0 

0 

1 

0 

1 

0 

1 

1 

1 

0 

0 

0 

1 

1 

1 

0 

1 

1 

0 

0 

0 

1 

1 

0 

1 

0 

0 

1 

1 

0 


For the seventh coset we select as a leader an element of low weight which 
has not yet appeared in the array. The same rule, of course, applies to 
the leader of the last coset. If a word in a column Uk (k = 1, 2, 3, 4) 
is received, the decoder conjectures that iik was transmitted. In doing so. 


the decoder corrects all possible 



= 5 single-error and 2 out of 



= 10 possible double-error patterns. 


The probability of error 


associated with this detection scheme for every transmitted word is 


1 - = 1 “ ((y' + + lOqY) 

= + 5p^(l — p) + 10p^(l — p)2 + 8p^(l — p)* 

The average probability of error for this decoding scheme is 1 — Qi. 
The selection of elements of lowest weight in the above situation has 
brought forth the largest possible Qi, that is, the code with the lowest 
error probability among all possible (5,2) codes. Also note that 


3 



i = U 


Now we are in a position to focus attention on Slepian^s important 
statement, that if positive integers n and k satisfy the inequality 

3 


k > 2«- 




then a simple parity-check code with a maximum-likelihood detection 
rule can be described which cannot be excelled by any other code of 
2* n-sequence binary words. No other code of the same size has a 
smaller average probability of error. 

A proof of this statement may be given in the light of the following 
lemma, which was first proved by Hamming in a simpler form. 

Hamming’s Lemma. The necessary and sufficient condition for an 
(rijk) group code to be 5-error correcting is that, except for the null 



440 


SOME RECENT DEVELOPMENTS 


sequence, every sequence to be transmitted has a weight not less than 
25 + 1. To prove the necessity, note that in a group code the null 
sequence is a word, and so must be at a distance of 25 + 1 or more from 
all other words. The sufficiency follows from the fact that in a group 
code, Ui 0 Uj = Vk is an element of the coset S (recall that S is a subgroup 
of Sn). Then, 

1I?A 0 u,\\ = IImaIi >25 + 1 

That is, all elements of S are at least 25 + 1 units apart. 

Denote the N = 2^ words to be transmitted by 

iniyU2, . . . ;un) 

where Ui is the null n secpieiice. If the condition of the lemma is satisfied, 
we select words with weights not exceeding 5 as C()S('t lead(‘rs. Thus, 
if this array is used as the detection scheme, we shall have \ 

H + Ml > llw.ll - IIMI > 5 + 1 > \M 

That is, the development, of the group is that of ii standard array. Con- 
versely, if Ih is a coset leader with ||6,1| < 5, then the eleniejits of that cosc^t 
should satisfy the condition 

11?^. + > ||^>^ll j = 1, 2, . . . , iV (i:V25) 

Now suppose that one of the u elements, say has a weight of 25 or less. 
It is not difficult to show by contradiction that, if the digits of b, and Uj are 
conveniently chosen, their sum may lead to a se(]uence with a higher 
weight than the coset leader. 

For an (rijc) single-error correcting group code, all transmitted words 
are of weight or more, except the zero sequence. Tlie coset kaiders 
are selected from elements with lowest weight, and thus a standard array 
is formed. If all words are ecpiiprobable, the source entropy per binary 
symbol becomes (1/n) log 2^ = k/n. The probability of error for any 
one of the transmitted messages is 

P[v E Vj\uj\ = 1 - Qj (13-26) 

The over-all error probability and equivocation jire, respectively, 

HiY\X) = - ^ Q, log Qi 

3 

The value of Qj depends on the weight of the coset leaders as described in 



GROUP CODES 


441 


Eq. (13-23). If coset leaders contain 7 , elements of weight i, then 


Qi = X 


1-0 


(13-28) 


Of co\irse and the total number of coset leaders must satisfy the 
conditions 


£ 7 . = 2 “-*' 


(13-2!)) 


For piveii values of (n,//) up to n = 12 and k = 10, Slepiaii evaluates 
the 7 cocdicicmls which correspond to the highest Q, in each cas(^ Table 
T-4 gives the values of coeflicients 7 ^ for dilTerent values of n and k. 
Table T-5 suggests suitable parity-ch(‘ck matrices. 

For a mor(^ detailed proof of this techniciue sec Slepian (I, particularly 
Sec. 2.7). 

The following example of group encoding and d(ieoding may prove 
useful in illustrating the described techniciue. 

Example 13-2. Dolcnniiu' a group code (0,11) by SU'pian’s iechnifiue. 

Sol u1 ion. Pirst selt^ct a suitalile parity -check matrix. A(“cordmg to Table T-5, 
we may c.1iooh(‘ 


"t 1 2 “ 

= «i + nr2 

■ j 1 0 ■ 

5 1 3 

= ai + of.j or [A 1 = 

1 0 1 

0 2 3 

fid ~ Of> O'.-! 

oil 


Wh(‘n the information vector covc'tis tlie eight possible columns of the following mes- 
sage matrix, we obtain the corresponding eight check vectors in a 3 X 8 matrix. 


1 

1 

0 ' 

“0 

0 

0 

1 

0 

1 

1 

r 


0 

0 

1 

J 1 

1 

0 

o' 

1 

0 

1 

0 

0 

1 

0 

1 

0 

1 

1 

= 

0 

1 

0 

1 1 

0 

1 

0 

0 

1 

l_ 

_0 

1 

0 

0 

1 

1 

0 



0 

1 

1 

0 0 

1 

1 

0_ 


Thus, the encoded messages are 


~ur 


'0 

0 

0 

0 

0 

0“ 

U2 


0 

0 

1 

0 

1 

1 

Hi 


0 

1 

0 

1 

0 

1 

7/4 


1 

0 

0 

1 

1 

0 

'W 5 


0 

1 

] 

1 

1 

0 

7^6 


1 

0 

1 

1 

0 

1 

W7 


1 

1 

0 

0 

1 

1 



1 

1 

1 

0 

0 

0. 



442 


SOME RECENT DEVELOPMENTS 


The rows of this matrix are now used as a coset for developing the group of 2® 
elements. This can be done by selecting the columns of the following matrix as 
coset leaders and computing each coset, as before. 

Coset leaders 
1 2 3 4 5 6 7 

0 0 0 0 0 1 1 

0 0 0 0 1 0 0 

0 0 0 1 0 0 0 

0 0 1 0 0 0 0 

0 1 0 0 0 0 0 

1 0 0 0 0 0 1 


The elements of cosets are obtained by nodulo 2 addition, as shown beloW. 


000000 

001011 

010101 

100110 

011110 

101101 

110011 

11100(i 

000001 

001010 

010100 

100111 

011111 

101100 

110010 

11100l\ 

000010 

001001 

010110 

100100 

011100 

101111 

110001 

lllOlO' 

000100 

001111 

010001 doooig;) 011010 

101001 

110111 

111100 

001000 

000011 

011101 

101110 

010110 

100101 

111011 

1 10000 

OlOOOO 

011011 

000101 

110110 

001110 

llllOl 

10001 1 

101000 

100000 

101011 

110101 

000110 

111110 

001101 

010011 

011000 

100001 

101010 

010100 

000111 

mill 

001100 

010010 

011001 


The detection scheme for this standard array is rather simple. If a message 100010 
is received, it will be detjoded as the message 100110. The corresponding noise vector 
is 000100. 

The probability of a correct detection over a BSC with P|l|l | = p is equal to the 
probability of the noise being of the form of a coset leader : 


Qi = X 
1 = 0 

where t* < is the number of coset leaders of weight i and is given in Table T-4: 

Qi = 9® H- 6p9® + 


This decoding procedure corrects all single errors and one double error out of a possible 

(3 


15 double errors. 


13-9. Further Notes on Group Codes. In this section, as a supplement 
to the material of the previous section, we discuss rules for selecting 
suitable [A] matrices. The topic has been discussed frequently in num- 
erous papers on coding. The first discussion appears as early as Ham- 
ming's paper of 1950. The problem has been explored by D. Slepian, 
G. Sacks, W. Peterson, E. J. McCluskey, R. C. Bose and R. R. Kuebler, Jr.,* 
and many others. 

R. C. Bose and D. K. R. Chaudhuri (see footnote, page 460) have 
shown that the necessary and sufficient condition for the existence of a 
5-error correcting (n,fc) binary group code is the existence of an n by n — k 


* Ann. Math, StatiaticSf vol. 31, no. 1, pp. 113-134, March, 1960. 



GROUP CODES 


443 


binary matrix of rank n — k such that any sets of its 26 rows are mutually 
independent. This ensures the feasibility of selecting all n sequences of 
weight less than or equal to 6 as coset leaders. (For a full discussion see 
“Error Correcting Codes” by W. W. Peterson.) 

The following method for the selection of a suitable fc by n — k parity- 
check matrix A' = and a corresponding maximum-likelihood detection 
scheme appears in Slepian (II). Integers n and k are chosen so that 

.3 

k > 2"-^ - ^ ~ (13-30) 

1. The first row of [A'] consists of n — fc Vs. The second row will be 
a binary n — fc sequence containing one 0 only. The succeeding rows are 
arbitrarily selected among all possible remaining distinct n — Ic sec|uences 
with only one 0. Then wo start using n — k sequences Avith only two 
O's and so on. This procedure is continued until fc rows are obtained. 
The fc X (n — fc) matrix thus obtained may be used as a permissible 
parity-check matrix as described earlier (see also Sec. 4-14). 

2. Suppose that a word v is received : 

V = ai, a2, . . . , otk] jSi, 02, . . . , &n-k 

Calculate 

Ik 

Si = ft ® X j = 1, 2, . . . , n - fc (13-31) 

1=1 

The sequence / = / 1/2 ■ • ■ fn-k may or may not coincide with a row 
(say the rth row) of [A']. Then the folloAving rules apply. 

1. If the two are identical, then a,, is an erroneously received digit. 

2. / is not a row of [A'] and does not contain exactly three Vs. The 
erroneously received digits are in the position where / has Vs. 

3. / is not a row of [A'] but has exactly three Vs. Then, search for a 
row (say the ith row) of [A'] having exactly four Vs, three of them in the 
same positions as the Vs of /. If the 1 which is not in common is in the 
jth column, then Oi and 0; were erroneously received. 

The proof of the above, although straightforward, would require 
considerable time and effort. The method in essence is similar to the one 
described in Sec. 4-14. 

The extension of the above systematic approach to the so-called Reed- 
Muller code* was also given by Slepian. This coding technique, which 
was discovered independently by Reed and Muller, in brief can be 
considered as a particular class of Hamming codes for which the length 

• See the bibliography at the end of the book. 



444 SOME RECENT DEVELOPMENTS 

of the binary sequence is selected as 


n = 2’” 

Number of information digits = k = 

1 = 0 

Nuni})er of check digits = n — k m and r positive integers 



(13-32) 
r < m 


The minimum distance between every two code words is 2^~^. Then 
up to 2"" — i erroneous digits can be corrected. A systematic pro- 

cedure for the determination of an m X 2"" matrix for parity checking and 
d(;vising a maximum-likelihood detection scheme is given in Slepian/(J J ). 

Algebraic Operations of Group Codes. In a recent article, Slepiim^ 
(‘xamines th(‘ concepts of e(iiiivalence and algebraic operations of jA'oiip 
codes. Two {n,k) codes are said to be e(|uivalent if one can be ohl Mined 
from the other by a fixed permutation of the i)lac(\s of every woi’d. T'or 
example, the following lhrt‘e (3,2) group code alphabets are eciuivaleqt : 


0 0 0 0 0 0 0 0 0 
1 0 0 0 1 0 0 0 1 
011 101 110 
111 111 111 


Two C(iui valent group codes liave the same proba])ility of erior wli,en 
all the words are transmitted with eijual probability. 

The 2^ words of an {a,k) group code can bi* obtained from any set of its 
k linearly iii(hqicnd(mt words, h'or this reason, a s(‘t, of k linearly inde- 
pendent Avords may be generally refi^rred to as a generating matrix of the 
code. The ndation bc'tween a k X n general ing matrix il and an 
(n — k) X n binary matrix A of rank 7i — k is givcm by 

MV = 0 

Thus, every parity-check matrix [A] of a group code [n,k) can be con- 
sidered as a generating matrix for the dual code (//, n — k). 

Two generating matrices ih and ih are said to be eijui valent if they lead 
to equivalent group codes. The following results are due to D. Slepian. 

Proposition 1 . Every k X ri binary generating matrix il is equivalent 
to the partitioned matrix 

[h\M\ 

where Ik is the k X k unit matrix and M is a /r X (n — k) binary matrix. 

Proposition 2. The necessary and sullicient. condition for two binary 
k X n matrices [Hi] and [iU] to generate e(|uivalent (n,A) codes is that their 
columns can be placed into a one-to-one correspondence that preserves 
mod 2 addition of the columns. 

* JX Slepian, Some Further Theory of Group Codes, Bell System Tech. October, 



GROUP CODES 


445 


An example of proposition 1 for (5,3) follows: 


'1 

0 

0 

1 

0 " 

0 

1 

0 

1 

1 

0 

0 

1 

0 

1 


In order to obtain a generating matrix equivalent to fii, one can make 
any mod 2 linear combination of the columns of 12i which does not obvi- 
ate the condition of one-to-one correspondence. For instance, we may 
consider 


122 


10 111 
110 10 
0 0 110 


A useful application of the concept of eciuivalent generating matrices 
can be given by defining algebraic operations on group codes. Let 
12a and 12,, be generating matrices for two group codes (n^k) and 
respectively. The generating matrix 12^, 





designates an (n + n', k -|- k') group code which will be denoted by C 
and rebured to as the sum of the two group codes (t and CB. 


e = (t -t- (B (13-33a) 

Each word from C is a sequence of a word from (i and CB. Owing to 
the ind('pendence of the noise, it is clear that the prt)bability of no error 
describ(;d in Eq. (13-28) for C is the product of the corresponding terms 
for Ct and Qi. 

When an {n,k) code d can be made eciuivahmt, to the sum of two {n,k) 
codes it is called a. decomposable code. Otherwise th(^ code is said to be 
indecomposable . A further interesting result of Slepian^s is that “every 
(/q/i") code d is ecjuivalent to a sum of indecomposable codes’\- 


Ct : Cti -|- + * * ■ + (i-7i 


The Kronecker product of two generating matrices ll® and 12,, gives a 
matrix 12^ which can be used as a generating matrix for a new group code 
(nn'jkld). 


12c = Kronecker product of 12a and 12,, = 


aul2ft 

ai2^b 



a,2i^b 

a^^b 


0’2nl26 

. . . 




- Clkl^b 

0>k2^b 


G/tnl2b- 


(where [avj = 12a). 

The resulting code is reierred to as the product of the codes Ct and (R. 


(13-336) 


C = a • (B 



448 

this, one has to let 


SOME RECENT DEVELOPMENTS 


V = b = — I r = k = 2m — \ \ = m — I (13-39) 

As a further point of classical interest, the cited paper observes that the 
statistical problem of balanced incomplete block design was in turn shown 
by J. A. Todd to be analogous to certain well-known prol)lems on the 
existence of Hadamard matrices of certain order. Thus certain types of 
problems of bounds on A(n,d) can be equivalently translated into a 
problem on Hadamard matrices. This is indeed an interesting basic 
observation which suggests some strong links between certain coding 
problems and some established parts of classical mathematics. { 

A Hadamard matrix is a s(iuare matrix of order a whose elements are 
-f-l or — I and any two rows (subsequently any two columns) are orthogo- 
nal. It was shown by Paley* in 1933 that, besides the trivial casi of 
a = 2, the only Hadamard matrices are those for which a is a multiple 
of 4. The result that the problem of balanced incomplete block desigii^is 
equivalent to the existence of a Hadamard matrix is due to Todd.f He 
has shown that the two problems are eejui valent when 


i) = J) = 4fn — 1 r = k = 2m — 1 
X = -wi — 1 ot = 4m 


(13-40) 


For methods of constructing Hadamard matrices the reader is referred to 
the references cited in Bose and Shrikhande. 

An extension of Hamming's results to p-riary codes has also appeared in 
the literature (see, for instance, Shapiro and Slotnick, loc. cii.). Let p 
denote any prime number and d the distance between two points (the 
number of coordinates in which they differ) ; then, the maximum number 
of points (n-tuples of symbols) with minimum mutual distance d will be 
denoted by A^^'^(n,d), The Hamming relations can be generalized in a 
straightforward manner. For instance, the following ineciuality is valid: 

2e + l)< (13-41) 

We close this introductory chapter on group coding by emphasizing 
that an ade(iuate coverage of coding theory rectuires time and space 
beyond the scope of this book. The reader is referred to the broad litera- 
ture on coding theory, particularly forthcoming books on the subject. 

Two of the most important areas which remain to be fully investigated 
are as follows: 

* R. E. A. C. Paley, On Orthogonal Matrices, J. Math. Phys., vol. 12, pp. 311-320, 
1933. 

t J. A. Todd, A Combinatorial Problem, J. Math. Phys., vol. 12, pp. 321-333, 1933. 



GROUP CODES 449 

1. The evaluation of sharp bounds on the probability of error for dif- , 
ferent (n,fc) codes and their relations to the channel capacity 

2. Extension of the existing coding techniques to any finite, semicon- 
tinuous, or continuous channels with or without memory 

While remarkable progress has been made (particularly during the 
past 5 years), much more integration is needed. “Our understanding of 
group alphabets is still fragmentary,” as was pointed out by D. Slepian. 


PROBLEMS* 


13-1. Coiisidc^r the following six matrices: 


ri 0] 


[1 01 r -1^ 


r -’2 

V31 


Lo iJ 


_o -il 

J 


J 





r 

) 2 v-n 

r -Ji 





l-H 

-H J 


-H J 

Show that they form a group under 

ordinary matrix multiplication, the first matrix 

l)(‘ing the identity clement. 





13-2. 

Same qu(‘-stion as in Prob. 13-1. 




ri ()- 


ro n r 1 oi 

r ” 

1] r-i 

“’1 

r-i -n 

lo 1_ 


[i oj L-i -iJ 

L-i - 

iJ L 1 

oJ 

0 ij 


13-3. Show the validity of the following identities for the words a, h, and c which 
are binary n se(]U('nces. 

(a) If \\a • h\\ = ||ri • c\\ = Hall, then ■ c\\ > Hall. 

(b) If a h - a and H^*ll = fhen a — b. 

(c) The necessary and sulficieut condition for Ha /^H = He ll lhat a • 6 = a. 

13-4. Devise a single-error correcting group code and the associated coding scheme 

for 

(a) n = 7 fc = 2 

(b) n = 7 k = 3 

(r) n = 8 fc = 2 

(d) n = 8 fc = 4 

In each case compute the error probability Q\ from Table T-4. 

13-6. Let the words of a singl(‘-error correcting {n,k) group code be designated by 
f/i = (0,0,0, . . . ,0), Ih, C/j, . . , , Un, where N = 2*. Show that any word Uj 
can be obtained as a linear combination of fc independent words {Ui excluded), that is, 

Uj = X2f^2 + Xgf7a + ' ■ ■ + Xifc+lf/fc-fl 7 = L 2j • ■ • I 


where (X-jXg, . . . ,X*n) arc binary elements. 

* For a supply of problems on coding, see a forthcoming monograph entitled “Error 
Correcting Codes” by W. W. Peterson (John Wiley & Sons, Inc., New York). 



APPENDIX 


ADDITIONAL NOTES AND TABLES 


During the past decade the field of information theory has grown by 
leaps and bounds. There have been new mathematical contribuijions, 
as well as a host of applications to physical and social problems. Coding 
theory has already become a professional field. Applications to detection 
problems, such as radar detection theory, embrace another professibnal 
area. The areas of application to some aspects of linguistics, electronics, 
computers, optics, psychology, and others seem to be following the same 
pattern of vspecialization. The mathematical foundation of information 
theory constitutes another specialized territory. 

The aim of this book has been an introductory presentation of the 
essentials of information theory. We have attempted to emphasize the 
fundamentals that are indispensable for an understanding of the subject 
prior to its application. The areas of specialized applications are outside 
the scope of the present work. Fortunately, an increasing supply iof 
literature dealing with the application of information theory to many 
specialized fields is available, and more books on the general subject are 
forthcoming. It is hoped that the reader will be stimulated to a more 
inclusive study of the subject. 

The following diversified notes may be of additional interest. The 
notes are not necessarily complete or self-sustaining. They are inserted 
as a few examples of the multitude of topics available for further reading. 
For each note, adequate reference for further pursuit is provided. The 
bibliography at the end of the book also may prove helpful for this 
purpose. 

N-1. The Gambler with a Private Wire. J. L. Kelly, Jr.,* has sug- 
gested an interesting model which presents the problem of the rate of 
transmission of information in a different way. 

Consider the case of a gambler with a private we who places bets on 
the outcomes of a game of chance. We assume that the side information 
which he receives has a probability p of being true and of 1 — p of being 
false. Let the original capital of the gambler be Fo, and Vr his capital 
after the Kih. betting. Since the gambler is not certain that the side 

• J. L. Kelly, Jr., A New Interpretation of Information Rate, Bell System Tech. J., 
vol. 35, pp, 917-926, 1956. 


450 



ADDITIONAL NOTES AND TABLES 


451 


information is entirely reliable, he places only a fraction e of his capital 
on each bet. Thus, subsequent to n bettings, assuming the independence 
of the successive tips, his capital is 

= (1 + c)-(l - eyVo (Nl-1) 

where w is the number of times he won and I = n — w the number of 
times he lost. These numbers are, in general, values taken by two 
random variables denoted by W and L. According to the law of large 
numbers (Sec. 7-8), 

lim i = p 

(Nl-2) 

lim “L = n= l — p 
n-»« n ^ 

The problem with which the gambler is faced is the determination of e 
leading to the maximum of the average exponential rate of growth of his 
capital. That is, he wishes to maximize the value of 

(? = lim - log ^ (Nl-3) 

n-f« n ^ Vo 

with respect to c, assuming a fixed original capital and specified p. 

G = p log (1 + e) + g log (1 - e) 

The maximum of G occurs when 

1+e 1 - e ^ (Nl-5) 

e ^ p - q 

Therefore, 

max 6 = p log 2p + g log 2g = 1 -f p log p -1- g log g (Nl-6) 

Thus, under these rather natural hypotheses, the maximum possible 
average exponential gain of the gambler coincides with the numerical 
value of the channel capacity. If the channel were noiseless, the gambler 
would obviously risk all his capital at each betting. This is, of course, in 
agreement with Eq. (Nl-1). Also, if he knew the value of p beforehand, 
he would be able to use this knowledge to his advantage and bet all his 
capital (or none). But the reliability of the tip is not known to him. 

According to Kelly, here we have an example of a real-life situation 
where considerations similar to the concept of source, channel, rate of 
transinformation, and channel capacity are valid. In the above refer- 
ence, Kelly extends these results to more general cases of a gambler plac- 
ing bets on outcomes of several games of chance. The gambler receives 
independent tips on each game conditional on the result of another game. 



452 


ADDITIONAL NOTES AND TABLES 


The situation is analogous to a discrete independent source driving a 
discrete mcmorylcss noisy channel. 

In conclusion, our acejuired knowledge of information theory, which was 
based primarily on Shannon’s communication model, can well be applied 
to other mathematical models arising from real-life problems. 

Bellman and Kalaba have successfully applied the theory of dynamic 
programming to Kelly’s model. They have also extended the problem 
to a type of problem whic.h could be referred to as stochastic learning 
'processes. In all these generalizations, the problem envisaged is the 
determination of some optimal policy for the gambler such that the 
expected value of the logarithm of his capital after n bettings is maxi- 
mized. Since the constraints of the problem are of the type of linear 
inequalities, the maximization procedure can manifestly be done by Bell- 
man’s dynamical programming techniques. The reader interested Uti 
applications of the theory of dynamic programming to the study of coi^i- 
munication processes may be interesi/cd in the articles by the following 
authors, as well as a number of references given there: R. Bellman and 
R. Kalaba, 11. Robbins, and M. B. Marcus. 

N-2. Some Remarks on Sampling Theorem. The literature available 
on sampling theorems is (juite extensive. Because of lack of space, only 
a few of th(^ existing references arc cited here. 

1. Reconstruction of a band-limited function ( — w,+w) from its sam- 
pled derivatives has been done by L. J. Rogel,* I). L. Jagerman,* F. Ki 
Bond and C. R. (Jahn,* and D. A. Linden and N. M. Abramson.* The 
latter authors show that for a continuous band-limited function one has 


x(t)= ^ 

fc = — « 


xikr) + (t — /cr)x'(fcT) + - — x"{kT) 


+ 


+ 


(<- 

R 


k is an integer (N2-1) 


2. As has been pointed out by several authors, the sampling intervals 
need not be uniformly distributed (see, for instance, J. L. Yen,t who 
discusses some special nonuniform sampling). 

♦L. J. Fogcl, A Note on the Sampling Theorem, JHE Trans, on Inform. Theory^ 
vol. IT-1, pp. 47-48, 1955; 1). L. Jagerman and L. J. Fogel, Some (leneral Aspeeta of 
the Sampling Theorem, lliE Trans, on Inform. Theory, vol. lT-2, pp. U^9“14(), 1950; 
1). A. Linden and N. M. Abramson, A Generalization of the Sampling Theorem, 
Inform, and Control, vol. 3, no. 1, pp. 26-31, 1960; F. K. Bond and C. R. Cahn, On 
Sampling the Zeros of Bandwidth Limited Signals, IRE Trans, on Inform,. Theory, 
vol. IT-4, pp. 110-113, September, 1958. 

t J. L. Yen, On Noniiniform Sampling of Bandwidth-limited Signals, IRE Trans, 
on Circuit Theory, vol. CT-3, pp. 251-257, December, 1956. 



ADDITIONAL NOTES AND TABLES 


453 


3. A. V. Balakrishnaii* has generalized the sampling theorem to the 
case of a continuous-parameter stochastic process. The main result of 
his paper is as follows : 

Theorem. Let {r(^)| be a real- or complex-valued stochastic process, 
stationary of second order, possessing a spectral density which vanishes 
outside the interval of [ — 27rlF,27rlTJ. Then (j'CO! has the following 
representation : 


1 :*•(<) 1 


l.i.m. 

N-^o, 


N 

Y I / n \1 siii7r(211T - n) 
L( 1 yiw ) ) ir(2TW - Vi j 

-N 


A method of proof consists in applying the sampling function for non- 
random functions to the covariance fimction of the process. 

4. Sampling theorem in n-dimensional space; The suggested proof for 
the sampling theorem can be directly generalized to a band-limited func- 
tion detiiKid in an n-dimensional space (n positive integer). To this end, 
one usually employs the n-dimensional P'ourier integral pairs. Let 
. . . ,Tn) be a function of n real variables; the Fourier transform 
of fi'jrirVi, . . . ,Jn) is defined by 

^'Xyhyi, • • • ,i/n) ^ f ' ' ' f 

<-~n— ► 

exp + ny-i + * * • + Jr,,y„)\ rfxi • ■ • cAr,, (N2-2) 

when this integral exists. Th(! inverse Fourier transform is defined by 


. . . ,x„) = (25r)-" / ■ ■ • / ■ • ■ .?/>.) 

*—n~* 

[+i(-^l?yi + -^'21/2 + • ' * + ThT/h)] dtji ■ • ' (Jf/n (N2-3) 

In the light of this delinition, one may state a generalized form of the 
sampling theorem in n-dimensional space. 

Theorem. Let be a function of n real variable's, whose 
n-dimensional Fourier integral exists and is identically zero outsid<' an 
n-dimensional rectangle symmetrical about the origins, that- is, 


Then 


. . . /tjn) = 0 for > lojfcl /c = 1, 2, . . . , n 


mi — -{ » 


^ + a 


«= S ■ 2 4": 'S) 


mi = — ® 


m„ = — « 


si n — m i/gji) ^ ^ si n cc njtr, — (N2-4) 

03\{t\ — Wl/ciJl) (Jinitn 


• A. V. Balakrishnan, A Note on the Sampling Principle for Continuous Signals, 
IRE Trans, on Inform. Theory, vol. IT-3, pp. 143-140, 1957. 



454 


ADDITIONAL NOTES AND TABLES 


The proof of this theorem follows directly from the proof of the sampling 
theorem in the time domain. For details, see the report by E. Parzen.* 
Other generalizations of sampling theorems have been obtained by A. 
Kohlenberg {J. Appl. Phys., vol. 24, 1953) and D. Gabor (London Sym- 
posium on Information Theory, 1960). 

5. According to Kolmogorov (I), the sampling theorem was also used in 
communication problems by the Russian scientist Koternikov.f 

6. An interesting informal derivation of the sampling theorem, along 
with several other results, has been given by the late Balth Van Der Pol 
{Ann. Computation Lah., Harvard Univ., vol. 29, pt. I, pp. 3-25, 1959). 

Let p{x) be an entire function with simple roots at {. . . ,ai,a 2 , . .| .} 
and/(a;) a band-limited time function with a cutoff angular frequency o|f ir; 
then, assuming that /(x) and p{x) have no common root, in an informal 
manner, Van der Pol writes \ 


1^1 = Y /(a*) 

p{x) Z/ p'(a*)(a: — a*) 


k integer 


(N2-^) 


If p{x) is selected to be equal to sin ttx, that is, 

{■ ■ • ,- 2 ,- 1 , 0 , 1 , 2 , • ■ 1 

one finds 

= y ]_ 

sin TTX Z-/ TT cos irk X — k 
^ — 00 

/w- t 

A; = — 00 

Several interesting classical results could also be derived from the 
sampling theorem. For instance, if /(x) is taken to be unity, we find 


or 


1 = 


1 

sin irt 



sin ir{t — k) 
ir{t “ k) 

i-iy 

t — k 


A = — « 


(N2-6) 


N-3. Analytic Signals and the Uncertainty Relation. While this topic 
is of basic interest in the communication sciences, adequate space for its 


• E. Parzen, “A Simple Proof and Some Extensions of the Sampling Theorem,” 
Stanford Univ. Tech, Kept. 7, Dept, of Statistics ^ December, 1956. 

t T. A. Kotel'nikov, Material for the First All-Union Conference on Questions of 
Communications, 1933. 



ADDITIONAL NOTES AND TABLES 


465 


presentation is not available here. Nonetheless, we wish to introduce 
the reader to the existence of the topic. D. Gabor* focused attention in 
this direction in 1946, followed by E. Wolf in 1947. 

Let f{t) be a normalized function of the real variable t, in L 2 , and 
F{o)) its Fourier integral transform [see Eqs. (9-18)]. Then |/(0h ^.nd 
|F(cij)|2 may be considered as probability density distributions for two 
random variables, say T and Q. The standard deviation of each of 
these random variables in terms of their moments is 

(ary = (t- fy\f{t)\^dt 

(N3-J) 

((rn )2 = jP - (a, - Q)*lF(a))l 2 du, 

What is referred to as the uncertainty relation in quantum mechanics 
implies that 

0’7’0’n ^ M (N3-2) 

For the proof of this relation, which is mathematically straightforward, 
the reader is referred to a text on (juantum mechanics. 

In the communication sciences we deal primarily with a real time func- 
tion f{t), which in turn implies 

|F(a,)| = lF*(-co)| 
arg F(a)) = — argF( — w) 

Si = 0 


When f(l) is real, by applying the Fourier integral transform formula it 
can be shown that the function f{t) satisfying the equality sign in the 
uncertainty relation is 


m = 


FM = 




( 2ffr 


« - f) 


4 ( 77 ^^ 


exp (— 


(N3-3) 


From a communication-theory point of view, D. Gabor finds it con- 
venient to introduce the concept of the analytic signal which is a 
complex function associated with a real time function fit) : 


m = V2Re /+(«) (N3-4) 

* The concept of analytic signals was introduced in communication theory by Gabor 
and Ville. For the definition and properties of analytic signals, sec Gabor, Ville, 
and Cherry: D. Gabor, Theory of Communications, J. Inst. elec. Engrs. {London), 
vol. 93, pt. Ill, pp. 4‘29-457, November, 1946; J. Ville, Th^orie et application do la 
notion de signal analytique, Cdhles et transm., January, 1948, pp. 61-74; C. Cherry, 
Quelques remarques sur le temps consid6re com me variable complexe, Onde eUcirique, 
vol. 34, pp. 7-13, January, 1954. 



450 


ADDITIONAL NOTES AND TABLES 


The analytic signal has the interesting property that its Fourier trans- 
form F^(oj) is identically zero for cj < 0, that is,* 

/+(/) = — f F(u)e^‘^‘ du 

Vtt Jo 

F_|_(aj) = \/2F(aj) CO > 0 

F+(co) =0 CO < 0 

Now the previously discussed probability distributions lend themselves 
to further simplification. In fact, for the new random variables and 

ar.crn. > I 2 (^^45) 

Vj. Wolf has pointed out that, if ar, exists, then = ot- Conversf^ly, 
the ('xistenct^ of ot implies the existence of ctt, if F(0) = 0. Subjeci to 
this reciuirement, the uncertainty relation becomes 

<^ro'i2+ > ?2 (N3-0) 

There are a large number of articles available on this subject in the 
engineering as well as th(‘ mathematical literature. To mention one, 
the readei’ is referred t-o Silverman and Kay. In this article, they derive 
the following interesting ine(iualit.y : 

> *211 “ 2lF(0)|2Q+l ^ ~ (0)1^111 (N:i-7) 

which reduces to the familiar uncertainty relation when F(0) = 0. 

Similar results but from a slightly different- point of view havc^ also 
appeared in the literature. S(m^, for instance, D. (1. Lam])ar(].t 

A function J{t) which is not identically zero cannot be band-limited in 
the time and frecjuency domains simultaneously. A i)roof of this state- 
ment can be obtained directly from a clasvsic theorem of Pahy and 
Wiener. t The theorem states: 

Tjct hi) a real iiorinegative square-integrable function not- etiuiva- 
lent. to zero in[— oo<w<+qo]. A necessary and sufheient condition 
that there should exist a function f{t) defined in — qo < / < qo and 

• Till' connection between analytic .signals, network theory, Hilbert transforms, and 
claSiMical function theory is discussed 111 detail by Oswald and Zi'inanian; J. It. V. 
Oswald, The Theory of Analytic Band-limited Signals Ajiplied to (,^arrit‘r Systems, 
JHE Trans, on Circuit Theory, vol. CT-3, pp. 245-251, December, 11)56; A. H. 
Zemanian, Network Realizability in the Time Domain, IRE Trans, on Circuit 
Theory, vol. CT-6, pp. 288-291, September, 1959. 

t D. G. Lampard, Definitions of “Bandwidth” and “Time Duration” of Signals 
Whirh Are Connected by an Identity, IRE Trans, on Circuit Theory, vol. (71-3, 
pp. 286-288, December, 1956. 

t R. E. A. C. Palcy and N. Wiener, Fourier Transform.s in the Complex Domain, 
Am. Math, Soc. Colloq., vol. 19, p. 16, 1934. 



ADDITIONAL NOTES AND TABLES 


457 


identically zero for some range t > <o, and such that 

0(co) = |P\)urier integral of /(Ol = |PXw)| 
is that j dcj < 00 (N:V8) 

It can be seen that, if F(a;) is identically zero over a finite range, over that 
range the above integral is not finite. (An intt^resting alt(‘rnative proof 
based on the sampling theorem is given by R. E. Weruikolf in the refer- 
ence cited b(do\v.*) 

N-4. Elias’s Proof of the Fundamental Theorem for BSC. Elias has 
suggested a proof for the fundamental theorem in the case of the two most 
common types of discrete noisy channels without memory, RSC and IIKV. 
His method of proof is based on what is ref(‘rred to as random l)lock 
coding. The messages are encoded in blocks each ii digits long. When a 
word is received, the receiver lists the most probable word, that is, the 
word in the code that difiers in the least number of places from tlie 
rec(‘ived sec|ueiice. If thiTe is more than one such word, the dcicisioii will 
necessarily be ambiguous. The plausibility of this method is, of course, 
due to the fa(;t that in a RSC with p < }2 fhe x)robability of a number 
of k errors occurring in a word is a monotonically decn^asing functicju 

of k (see Secs. 4-11 and 4-l()). 

L(‘t ea(;h A?-digit message contain m information digit s and n — ?;/ parity 
digits. Thus, the rate of transmission of information R is vi/n. We 
assume without loss of generality that the number of messages to be 
transmitted is 

M = 2"^ = 2 "" 

and the parity check used allows correction of all sets of ki or fewer errors 
in each n-digit block. If all the input block words are transmitted with 
equal probability, the transinformation per symbol will be 


B = ilog J/ < ilog [2-/I (“)] - 1 - jloe X (]) (N4-1) 

y=o 

(See, for instance, the sphere-packing argument of Secs. 18-5 and 4-18.) 
The eciuality can hold only under the most favorable circumstances of 
lossless coding. The minimum ambiguity probability corresponding 
to those received words that are not within the disjoint spheres of radii ki 

* R. E. Wernikofl', Time-limited and Band-limited Functions, Mass. Inst. TecknoL, 
Research Lab. Electronics^ Quart. Progr. Kept., January, 1957, pp. 72-74. 



458 


ADDITIONAL NOTES AND TABLES 


centered at the transmitted words can be directly computed from the 
tail of the binomial distribution : 


n 



Next, consider all possible such parity codes selected at random. That 
is, we select at random 2"^ n-symbol words from a total of possible 
such words. Of course, since a word may be selected twice or more as a 
code word for a given message, the ensemble of these codes may contain 
some very poor codes. However, the average ambiguity of all t^hese 
codes cannot be less than the ambiguity Qb of the best of them. jClias 
has proved that for a fixed fci and 


P < Pi = 


b: 

n 


< Pcr.t 


Vp 

Vp + Vq 


we may write 
Qb < Qav < 


/ n \ r pqi 

\npij [pi - p 


+ 


1 - iq/p)(pi/qiy 


> Qopt > 



Pi + i/n 


(N4^-2) 


(N4-3) 


Using Stirling's approximation, these inequalities can be rewritten as 

Qb < Q„ < [^- + 1 - (g/i(p,/g,)J 

g exp2 I -n + (pi - p) log ijj 

^ “ pf- p I -Q + + (Pi - P) log I j| (N4-4) 

Elias concludes that Q^, the ambiguity probability of the best code, 
exponentially depends on n. This result may be finally expressed in the 
form 

;c22-«^» < (N4-5) 


The probability of error, on an average, is bounded above and below by 
exponential functions. The terms Bi and B 2 arerindependent of n and 
depend solely on the channel parameters and the specified R. ki and k 2 
are nearly constant terms. Thus, for a given channel and transmission 
rate, the probability of error can be made arbitrarily small by increasing 
the block length. Similar results hold for the case pi > pent- The 
mathematical derivation of these formulas is somewhat complex. The 
reader is referred to Elias's article in the Proceedings of the Third London 
Symposium on Information Theory, 1955. A mathematically more 



ADDITIONAL NOTES AND TABLES 


459 


refined treatment of Elias’s original exponential bounds is presented in 
Feinstein (I). 

Similar results for a BSC were also independently derived by several 
other authors, for instance, G. A. Barnard (Third London Symposium on 
Information Theory, pp. 96-102, 1955) and E. N. Gilbert {Bell Tel. 
Labs. Tech. Mem.j June, 1956). An alternative proof was also given by 
J. M. Wozencraft {Mass. Inst. TechnoL, Research Lab. Electronics^ 
Quart. Progr. Rept.^ Jan. 15, 1958, pp. 90-95). Wozencraft’s proof is 



Fio. N4-1. A graphical method for determination of the error-probability bounds. 


motivated by Shannon’s method, as described in Chap. 12. He assumes 
maximum-likelihood decoding procedure and computes accordingly the 
probability of error. A bound for the probability of error is obtained by 
using Shannon’s method, which, in lieu of Elias’s use of approximation by 
Stirling’s formula, applies Chernov’s inequality to mutual information. 

A geometrical interpretation of the dependence of Bi and B 2 on channel 
probability p is illustrated in Fig. N4-1. The curve represents the capac- 
ity of a BSC for various values of p. 

C = 1 + p log p + g log g = 1 — H{p) (N4-6) 

The equation of the tangent to the curve at a point with abscissa r is 

Y = C{r) + (p - r)[log r - log (1 - r)] (N4-7) 



460 


ADDITIONAL NOTES AND TABLES 


The exponent Bi is given by the difference of the ordinates of this curve 
and the tangent at r = p for any desired transmission rate when pi < 

This is illustrated in Fig. N4-1 for a specified transmission rate R. For 
Pi < Pnrit, the exponent B 2 is equal to i?], but for pi > p„,^ a slight modi- 
fication of the above procedure is in order. We note that in this range 

B 2 = C(r) - [2C(r) + F(pO - 2C{p,)] < B, 

Thus, for a geometrical interpretation one has to trace a new curve and 
measure the diderencci of its ordinate with that of the capacity curve, as 
illustrated in Fig. N4-1.* . 

Also, note tliat in region px < p^it the two exponents are identlical; 
that is, the upper and the lower bounds are proportional. In that region 
the ratio of the two bounds is proportional to n. In the range p\ > Ucr.t, 
the bounds diverge expoiunitially. ' 

N-6. Further Remarks on Coding Theory 

7'he Bofie-Chandhvri t-error Correcting Group Codes. An interesting 
generalization of Hamming\s and Slepian's work has been derived by 
11. C. Hose and D. K. R. Chaudhuri.t They have derived the necessary 
and sufticient conditions for a binary group e.ode {n,k) to be a 6-error cor- 
recting code. Their work contaijis material of theoretical as well as 
practical interest. We shall (jiiote some of their main n^sults with little 
(if any) indication of their method of derivation. For the proof the reader 
is referred to the original article. 

Theorem 1. The necessary and sufficient condition for the existence 
of a 6-error correcting (n,fc) binary group code is the existence of an 
n X r matrix [A] with r = n — k such that any 26 row vectors of A 
are mutually independent. 

Theorem 2. If n = 2^” — 1, there exists a 6-error correcting binary 
group cod(j (n,/c) with 

k > n — bm 

The proof of Theorem 1 is based in part on Hamming's lemma, which 
states: The necessary and sufficient condition for a binary 6-error cor- 
recting group code (n,/c) is that each word (except the null word) have a 
norm of 26 + 1. Theorem 2 is a sharper statement than the result 
obtained earlier by Varsamov.J The latter author has shown that 

* See IClias, cited above, and R. M. Fano, The Statistical Theory of Information, 
Nuovo cimenlo, ser. X, vol. IS, suppl. 2, pp, 353-372. 

t R. C. Bose and D. K. R. Chaudhuri, On a Class of Error Correcting Binary Group 
Codes, Inform, and Control^ vol. 3, no. 1, pp. 68-7U, 1960. 

t R. R. Varsamov, The Evaluation of Signals in Codes with Correction of Errors, 
Doklady Akad. Nauk S.S,S.R. new series, vol. 117, pp. 739“741, 1957. 



ADDITIONAL NOTES AND TABLES 


461 


if k satisfies the inecjuality 


where S," ■ 1 +(!■) + Q + ... + Q 


(N5-1) 


then a 5-error correetiriK i)inary group code {n,k) exists. 

The merit of the proof suggested by Hose and (liaudhuri lies particu- 
larly in the fact that it provides a cousiructive mt'thod for the codes. 
Also, the implementation of th(\se codes does not seem t,o be too complex 
(sec, for instance, IVterson*). Peterson points out that tlu'se (a)des 
have a cijdic property and hence can be implenuaiti'd with what is called a 
ahift-rcgistcr generator, as demonstratcal earlier by Pivinge.t ('Hie cyclic 
propert-y implies that, if a word u = tti, , a,, G Hio words 

obtained by cyclically shifting digits of ii in sfune fashion are also mem- 
bers of S, for example, ih = a„, ax, az, ... , a,, .! G An early study 
of shift-register generators can be found in a report by N. Ziinh'r.J) 

Dependent Error Correetion. In all error corn'cting s(‘hemes thus far 
presented, the occurrence of error was assumial to b(‘ a stiitistically inde- 
pendent phenomenon. In many data-processing systems the occur- 
rence of an error in a particular binary digit is conditional on the occur- 
rence of error in the. preceding digits. Several interesting procedures for 
the detection and corn^ction of interdependent errors have appeared in 
the literature for hSpecial error patterns, although a gencu’al solution has 
not yet been devised. An interesting class of these codes has been investi- 
gated by N. M. Abramson, P. Fire, and D, W. llagelbarger. The latter 
author devises codes for correction and detection of a “burst" of errors 
(for example, when lightning may knock out, vseveral adjacent, ttdegraph 
pulses). A brief discussion of Abramson's approach is presented Ixdow 
without reference to the practically important problem of instrumentation. 

Abramson has suggested a code Avhich corrects single or double adjacent 
error (SEC-DAEC). Let m be the number of information digits and k 
the number of parity checks; the number of distinct single and double 
adjacent errors in a waird with 71 = m + k digits is n + n = 2n. The 
parity-check number k then must satisf}^ 


2‘ > 2(,» + H + 1 

m < 2*-' - t - I .i 

* W. W. Peterson, “Error Correcting Codes,” John Wiley & Sons, Ine., New York, 
t E. Praiige, “Some Cyclic Error-correcting Code's witli Simple Decoding Algo- 
rithms,” Air Force Cambridge Research Center AFCR(/-TN-.'i8-15(), April, 1058. 

t N. Zierler, Several Bin ary -sequence Generators, Mass. Inst. TtchnoL, Lincoln 
Lab. Tech. Kept. 95, September, 1955. 



462 ADDITIONAL NOTES AND TABLES 

When the number of parity digits satisfies the equality 

m = mo = 2*“^ — k — 1 

the code is referred to as a complete code. The following values are 
given in Abramson. 

fc 4 5 6 7 8 9 10 

Wo 3 10 25 56 119 246 501 

To devise a complete SEC-DAEC, one has to set up a set of 2n + 1 binary 
equations whose solution determines the single or the double adjaeent-epor 
position. A systematic method for setting up these ecjuations and! an 
instrumentation technique based on the mse of shift register are given in 
the above cited reference. \ 

Convolution Codes. Most of the work available on coding theoryus 
related, in one way or another, to systematic block codes. A coi\i- 
pletely different type of coding was suggested by J. M. Wozencraft in 
1957. In these codes, the information message is a single sequence of 
binary digits. The parity checks assume some specific pattern; for exam- 
ple, they may be interlaced by information digits. Each check digit is 
determined by the preceding digits through a checking equation. Of 
course, theoretically, for a long message, it seems that one has to take into 
consideration the effect of all transmitted digits. But, for all practical 
purposes, as far as the error probability is concerned, one may confine 
oneself to a suitable number of immediately preceding digits. A most 
significant property of convolution codes is the fact that in a certain sense 
the ^^average amount of computation per digit for encoding-decoding, 
hence the equipment needed for implementation of codes, is quite realistic. 
The encoding-decoding procedure for convolution codes and the computa- 
tion of error probability have been accomplished in the past 3 years. A 
monograph giving a full description of these codes is in preparation by 
J. M. Wozencraft of Massachusetts Institute of Technology. (See also 
M. A. Epstein.) 

N-6. Partial Ordering of Channels. A recent contribution of C. E. 
Shannon* has provided a basis of comparison for some communication 
channels through their stochastic matrices. WhileAt is early to speculate 
on possible applications of Shannon's original idea, it appears that it will 
encompass some important areas of investigation. Algebraic operations 
on stochastic matrices and their physical interpretation seem to indicate 
an area where information theory and systems theory considerations could 
join forces. 

Consider two discrete memoryless channels with m input and m output 

* C. E. Shannon, “A Note on a Partial Ordering for Communication Channels, 
Information and Control,’^ vol. 1, pp. 390-397, Academic Press, Inc., New York, 1958. 



ADDITIONAL NOTES AND TABLES 


463 

symbols. The cascading of these channels is equivalent to a channel 
whose stochastic matrix K is the product of the two corresponding 
stochastic matrices K\ and Ki. In the following, for simplicity, we con- 
sider first the cascading of such channels. The results can be generalized 
in a direct fashion. 

A channel Kx is said to include another channel if there exists a 
channel K such that can be obtained by cascading Ki and K. The 
stochastic matrix of is the product of the stochastic matrices of Ki 
and K. The channel-inclusion relation is denoted hy Ki^ If for 
two channels neither Ki 2 K 2 nor Kz C Ki, then the two channels are 
mutually exclusive or not comparable. We also say that such channels 
have no partial ordering. The following properties may be derived : 

1. Transitive property. If Ki 2 Kz and /v 2 2 A 3 , then 

Kx 2 A, 

2. Multiplication. If 

Kx 2 Aa 
Kz 2 A 4 

then A 1 A 2 2 A 3 A 4 

3. Convexity. Let Ai, A 2 , and Aa be the stochastic matrices of three 
channels such that 

Ai2 A 2 
Ai2 Aa 

and A the stochastic matrix of a new channel where 

K = \Kz + (1 - X)A3 
0 < X < 1 

Then Ai 2 A 

Tlie above definition and properties can next be generalized for defining 
a partial ordering of two finite channels Kx and Kz (not necessarily 
m X m). We say that Ai 2 A 2 if Ai could be derived from A 2 by a pre 
and a post cascading channel, that is, 

Ai = AKzB (N6-1) 

From a physical point of view, as Shannon has described, partial ordering 
means roughly that some sort of operation is applied to a channel Kz in 
order that it look like Kx (for instance, cascading of A, A 2 , and B), 
As an exercise, the reader may wish to derive the necessary and suf- 
ficient conditions for two given binary channels to have a partial ordering. 
Shannon has derived an interesting theorem on two channels with a 
partial-ordering relation, namely, 

Theorem. Let 

1 . (mi, m 2 , . . . ,mjv) be a set of n-sequence words transmitted with 
specified probabilities (pi,p 2 , ■ . . ,Pn) over a discrete memoryless 
channel Ai. 



464 


ADDITIONAL NOTES AND TABLES 


2. pci be the probability of error for this channel under some specified 
decoding scheme. 

3. 

A’l 3 Ki 

Then there exists a set of N n-se(|uence messages and a decoding scheme 
for A -2 su(‘li that, if messages are transmitted with the same input proba- 
bilities as before, the error probability (^^ 2 ) will not increase, that is, 

Pc2 ^ 'Pel 

For a mathematical proof of this intuitively plausible statement, see 
Shannon {luc. cit.), j 

N-7. Information Theory and Radar Problems. Th(‘ central problem 
in tliis area is the detection of radar signals of kiuiwii cliaracteristica in 
the preseiKie of noise. For instance, if a signal is transmitted apd 
y(t) is received, assuming some additive noise, we have 

?y(0 = x(l — t) + noise 

A first probl(‘m is the evaluation of delay time r and the radar range by 
comparison of .r and ?/. (generally the noise is taken to be gaussian and 
;r and y as random variables with specified characteristics. Thus, subse- 
(luent to certain plausible assumptions, oiie is able to find some type of 
probabilit.y distribution for the delay and the range. 

The role of information theory, although perhaps conceptually enlight- 
ening from a procedural point- of view, is a secondary one. The most 
complex part of tlie problem is in formulating and solving an input- 
output type of probability problem for some linear or nonlinear system. 
When the problem is solved, that is, a probability distribution function 
for tl)e unknown is derived, the entropy associated with that distribution 
reveals a certain measure of uncertainty about the searched (luantity. 
This problem falls in the general field of statistical ('xtrac.tlon of signal 
from noisy background and design of optimum filters. Thus it appears 
that, while the problem occupies an important place* in tlui statistical 
theory of communication, it is not immediately related to our subjcict. 
Furthermore, several sources with adequate coverage of this subject are 
available. For the benefit of the reader the following list of references is 
included. A treatment of the now classic work of Woodward and Davis 
on radar problems appears in P. M. Woodward (Chaps. 5-7). 

For a concise presentation of radar detection theory V)ased on the 
maximum-likelihood technique, see W. B. Davenport, Jr., and W. L. 
Root (Chap. 14). 

For a comprehensive theoretical study of the subject, including the 
work of Middleton and Van Meter, and a list of the most recent con- 
tributions, see D. Middleton. 



ADDITIONAL NOTES AND TABLES 


465 


Table T-1. Normal Probability Integral 

0(0 = f dt 

y/2ir Jo 


2 

^(*) 

2 

«/•(*> 1 

z 

I 


z 


0 00 

0 0000 

0 

05 

0 

2422 


30 

0 

4032 

1 

95 

0 

4744 

0 01 

0 0040 

0 

06 

0 

2454 


31 

0 

4049 

1 

90 

0 

47.50 

0-02 

0 0080 

0 

67 

0 

2480 


32 

0 

4000 

J 

97 

0 

47.50 

0 on 

0 0120 

0 

08 

0 

2517 


33 

0 

4082 

1 

98 

0 

1701 

0 04 

0 0100 

0 

09 

0 

2549 


34 

0 

4099 

1 

99 

0 

47(i7 

0 05 

0 0199 

0 

70 

0 

2580 


35 

0 

4115 

2 

00 

0 

4772 

0 00 

0 0239 

0 

71 

0 

2011 


30 

0 

41.11 

2 

02 

0 

178.3 

0 07 

0 0279 

0 

72 

0 

2042 


37 

0 

4147 

2 

01 

0 

1793 

0 08 

0.0319 

0 

73 

0 

2073 


38 

0 

4102 

2 

00 

0 

4803 

0 09 

0 0359 

0 

74 

0 

2703 


39 

0 

4177 

2 

08 

0 

4812 

0 10 

0 0398 

0 

75 

0 

2734 


40 

0 

1 1 92 

2 

10 

0 

1821 

0.11 

0 0438 

0 

70 

0 

2704 


.41 

0 

4207 

2 

12 

0 

4830 

0 12 

0 0478 

0 

77 

0 

2794 


42 

0 

4222 

2 

1 1 

0 

1838 

0 in 

0 0517 

0 

78 

0 

2823 


4.1 

0 

4 230 

2 

10 

0 

•ISKi 

0 14 

0 0557 

0 

79 

0 

2852 


41 

0 

4251 

2 

18 

0 

4851 

0 15 

0 0590 

0 

80 

0 

2881 


45 

0 

4205 

2 

20 

0 

)8i.l 

0 10 

0 0030 

0 

81 

0 

2910 


40 

0 

4279 

2 

22 

0 

181)8 

0 17 

0 0(;75 

0 

82 

0 

2939 


47 

0 

4292 

2 

24 

0 

1875 

0 IH 

0.0714 

0 

83 

0 

2907 


48 

0 

430(. 

2 

20 

0 

1881 

0 19 

0 0753 

0 

81 

0 

2995 


49 

0 

4319 

2 

28 

0 

1887 

0 20 

0 0793 

0 

85 

0 

3023 


50 

0 

4332 

2 

.30 

0 

1893 

0 21 

0 0832 

0 

80 

0 

3051 


51 

0 

1.315 

2 

32 

0 

1898 

0 22 

0 0871 

0 

87 

0 

3078 


52 

0 

4357 

2 

34 

0 

1901 

0 2n 

0 0910 

0 

88 

0 

3100 


53 

0 

4.170 

2 

30 

0 

1909 

0 24 

0 0948 

0 

89 

0 

3133 


54 

0 

4.382 

2 

38 

0 

1913 

0 25 

0 0987 

0 

90 

0 

3159 


5.5 

0 

439 1 

2 

40 

0 

1918 

0 20 

0 1020 

0 

91 

0 

3180 


50 

0 

1100 

2 

42 

0 

1922 

0 27 

0 1004 

0 

92 


3212 


57 

0 

4418 

2 

4 1 

0 

1927 

0 28 

0 1103 

0 

93 

0 

3238 


58 

0 

4429 

2 

10 

0 

193] 

0 29 

0 1141 

0 

94 

0 

320.1 


59 

0 

444 1 

2 

18 

0 

1931 

0 no 

0 1179 

0 

95 

0 

3289 


00 

0 

1452 

2 

.50 

0 

19.18 

0 ni 

0 1217 

0 

90 

0 

3315 


01 

0 

1403 

2 

52 

0 

4911 

0 32 

0 1255 

0 

97 

0 

.3310 


02 

0 

4 171 

2 

54 

0 

1945 

0 nn 

0 1293 

0 

98 

0 

3305 


03 

0 

4 484 

2 

.51. 

0 

4918 

0 34 

0 1331 

0 

99 

0 

3389 


04 

0 

1495 

2 

.58 

0 

4 951 

0 35 

0.1308 


00 

0 

3413 


(.5 

0 

1.505 

2 

liO 

0 

1953 

0 30 

0 1400 


01 

0 

3438 


(.0 

0 

151.5 

2 

02 

0 

4950 

0 37 

0 1443 


02 

0 

3101 


07 

0 

1.52.5 

2 

() 1 

0.4959 

0 38 

0 1480 


03 

0 

3485 


08 

0 

1.5.35 

2 

(lO 

0 

1901 

0 39 

0 1517 


04 

0 

3508 


09 

0 

4.545 

2 

(.8 

0 

190)3 

0 40 

0 1554 


05 

0 

3531 


70 

0 

1.5.51 

2 

70 

0 

1905 

0 11 

0 1591 


00 

0 

,1554 


71 

0 

4.5(.4 

2 

72 

0 

1907 

0 42 

0 1028 


07 

0 

3577 


72 

0 

4 573 

2 

74 

0 

1909 

0.43 

0 1004 


08 


3.599 


7.1 

0 

4582 

2 

70 

0 

1971 

0 44 

0 1700 


09 

0 

3021 


74 

0 

4.591 

2 

78 

0 

4973 

0 45 

0 1730 


10 

0 

3043 


7.5 

0 

4.599 

2 

80 

0 

4974 

0 40 

0 1772 


11 

0 

3005 


70 

0 

4008 

2 

82 

0 

4970 

0 47 

0 1808 


12 

0 

3080 


77 

0 

4010 

2 

81 

0 

4977 

0 48 

0 1814 


13 

0 

3708 


78 

0 

4025 

2 

80 

0 

4979 

0 19 

0 1879 


1 1 

0 

3729 


79 

0 

4(.33 

2 

88 

0 

4980 

0 50 

0 1915 


15 

0 

3719 


HO 

0 

4(>41 

2 

90 

0 

1981 

0 51 

0 1950 


.10 

0 

3770 


81 

0 

4049 

2 

92 

0 

4982 

0 52 

0 1985 


17 

0 

3790 


82 

0 

40.5(i 

2 

91 

0 

4984 

0 53 

0 2019 


18 

0 

3810 


83 

0 

1004 

2 

90 

0 

, 4985 

0 54 

0 2054 


19 

0 

3830 


84 

0, 

,4071 

2 

98 

0 

4980 

0 55 

0 2088 


20 

0 

.3819 


85 

0 

4078 

3 

00 

0 

49805 

0.56 

0 2123 


21 

0 

3869 


80 

0 

4080 

3 

.20 

0 

49931 

0.57 

0 2157 


22 

0 

3888 


87 

0 

4093 

3 

40 

0 

49906 

0 58 

0 2190 


23 

0 

3907 


88 

0 

4099 

3 

60 

0 

499841 

0.59 

0 2224 


24 

0 

3925 


89 

0 

1706 

.3 

80 

0 

499928 

0,00 

0 2257 


25 

0 

.3944 


90 

0 

4713 

4 

00 

0 

499968 

0 01 

0 2291 


20 

0 

3902 


91 

0 

4719 


50 

0 

499997 

0 02 

0 2324 


27 

0 

3980 

1 

92 

0 

4720 

5 

00 

0 

499997 

0 63 

0 2357 


.28 

0 

3997 

1 

9.3 

0 

47.32 





0 64 

0 2389 


29 

0 

4015 

1 

94 

0 

4738 







466 


ADDITIONAL NOTES AND TABLES 


Table T-2. Nobmal Distributions 


P(0 < X < z] 




PllX 


A 


> 


P|X < z} 


P{X > z\ 




0 « 

-S 0 2 

— oo — Z 0 2 OO 

-oo 0 z 

0 2 *» 

1.000 

0.34134 

0.68268 

0 31732 

0 84134 

0 15866 

1.960 

0.47500 

0 95000 

0.05000 

0 97500 

0 02500 

2.000 

0 47725 

0 95450 

0 04550 

0 97725 

0 |b2275 

2.576 

0.49500 

0 99000 

0 01000 

0 99500 

0 00500 

3.000 

0.498G5 

0.99730 

0 00270 

0 99865 

0.60135 

3.291 

0 49950 

0.99900 

0 00100 

0 99950 

0 00050 



ADDITIONAL NOTES AND TABLES 


4(>7 


Table T-3. A Summary of Some Common Probability Functions 


Probability 

function* 

Mean 

E{X) 

Variance 

cr2 = EiX^) - [E{X)]^ 

Characteristic function 

Eie^^x) 

(fc) 

A- * 0, 1 , 2, . . . , n 

np 

npq 

(pe»* + g)" 

fc = 0, 1 , 2, . . . , 

X 

X 

- 1) 

1 

z 

a ‘\/'2Tr 

a > 0 — QO<J-<QO 

}n 


^}tm~ y^t^cr'^ 

1 1 

fM = - 

IT 1 'X“ 

— 00 < r < ^ 

0 

See Sec. 6-8 

e-l'i 

/(<■) = ' - 
h — a 

a < X < h 

a -f h 

2 

{h - a) 2 

V2 

jf{h - a) 

m = 

2a 

~a < X < a 

0 

a2 

3 

sin af 

at 

/(■r) = 1 - Ij-1 

Ij-I < 1 

0 

1 

X 

1 

(* 

- cos n 

Ac X > 0 

0 < .r < 00 

1 

X* 

1 

^ 2^''W 

— oo<j‘<oo a>0 

0 

2 

a* 

a2 +7^ 


* Assumed to be zero outside the domain of definition. 



468 


ADDITIONAL NOTES AND TABLES 


Table T-4. Probability of No Error for Best Group Code* 
Qi = J^y 




/n\ 

k = 2 

fc = 3 

/c = 4 

II 

/c = 6 

fc = 7 

00 

A: = 9 

A; = 10 



/ 

7i 

7» 

7i 

7t 

7i 

7t 

7i 

7i 

7i 


0 

1 

1 









71 = 4 

1 

4 

3 










0 

1 

1 

1 








w = 5 

1 

5 

5 

3 









2 

10 

2 







i 



0 

1 

1 

1 

1 







71 = 6 

1 

6 

6 

6 

3 





\ 



2 

15 

9 

1 









0 

I 

1 

1 

1 

1 







1 

7 

7 

7 

7 

3 





\ 

71 = 7 

2 

21 

18 

8 









3 

25 

6 










0 

1 

1 

1 

1 

1 

1 





71 = 8 

1 

8 

8 

8 

8 

7 

3 





2 

28 

28 

20 

7 








3 

56 

27 

3 









0 

1 

1 

1 

1 

1 

1 

1 



1 


1 

9 

9 

9 

9 

9 

7 

3 




n = 9 

2 

36 

36 

33 

22 

6 







3 

84 

64 

21 









4 

126 

18 










0 

1 

1 

1 

1 

1 

1 

1 

1 




1 

10 

10 

10 

10 

10 

10 

7 

3 



n * 10 

2 

45 

45 

45 

39 

21 

5 






3 

120 

110 

64 

14 








4 

210 

90 

8 









0 

1 

1 

1 

1 


1 

1 

1 

1 



1 

11 

11 

11 

11 


11 

11 

7 



71 = 11 

2 

55 

55 

55 

55 


20 

4 




3 

165 

165 

126 

61 








4 

330 

226 

63 









5 

462 

54 










0 

1 

1 

1 




1 

1 

1 

1 


1 

12 

12 

12 


1 


12 

12 

7 

3 

71 = 12 

2 

66 

66 

66 




19 

3 



3 

220 

220 

200 









4 

495 

425 

233 









5 

792 

300 










* Reproduced from D. Slepian, Bell System Tech. vol. 35, p. 213, January, 1956, 
with kind permission of Bell System Technical Journal. 




Table T-5. Parity-check Rules for Best Group Code 




Table T-5, Parity-check Rules for Best Group Code {Continued) 



470 



ADDITIONAL NOTES AND TABLES 


471 


Table T-6. Logarithm to the Base 2 


N 

Log N 

N 

Ixig N 

N 

Log N 

N 

Log N 

1 

0.000000 

51 

5 672425 

101 

6.67)8211 

151 

7.238404 

2 

1.000000 

52 

5.700439 

102 

f. C72425 

152 

7 247927 

3 

1 . 584962 

53 

5.727920 

103 

C 686.500 

153 

7 257388 

4 

2.000000 

54 

5 754887 

104 

6 700439 

154 

7 269780 

5 

2 321928 

55 

5.781359 

105 

6 714245 

155 

7.276124 

6 

2,584962 

56 

5.807355 

106 

6 727920 

156 

7.285402 

7 

2 807355 

57 

5 832890 

107 

6.741467 

157 

7 , 294620 

8 

3.000000 

58 

5 857981 

108 

6 7.54887 

158 

7 303780 

9 

3. 169925 

59 

5.882643 

109 

6 768184 

159 

7 312883 

10 

3 321928 

60 

5 906890 

no 

6.781850 

160 

7 321928 

11 

3 459431 

61 

5 930737 

111 

6.794415 

161 

7 330916 

12 

3.584062 

62 

5.954196 

112 

6.807355 

162 

7,339850 

13 

3 700440 

(■)3 

5.977280 

113 

6.820179 

163 

7 348728 

14 

3.807355 

64 

6 000000 

114 

6.832800 

164 

7 357552 

15 

3 . 906890 

65 

6 022367 

115 

6 845490 

165 

7 366322 

IG 

4.000000 

66 

6 044394 

116 

6 857081 

166 

7 375039 

17 

4 087463 

67 

0.060089 

117 

6.870364 

167 

7 383704 

18 

4 169925 

68 

6 087462 

118 

6 882643 

168 

7,392317 

19 

4 247927 

69 

6 108524 

119 

6 894817 

169 

7.400879 

20 

4.321928 

70 

0,12928;j 

120 

6 906890 

170 

7 409391 

21 

4 392317 

71 

6 149747 

121 

6 918863 

171 

7.417852 

22 

4 459431 

72 

6 169925 

122 

6.930737 

172 

7 126204 

23 

4 523562 

73 

6.189824 

123 

6 942514 

173 

7 434628 

24 

4 584962 

74 

6 209453 

124 

6.954196 

174 

7 442943 

25 

4,643856 

75 

6 228818 

125 

6 965784 

175 

7 451211 

26 

4 700439 

76 

6.247927 

126 

6 977280 

176 

7 459431 

27 

4.754887 

77 

6 266786 

127 

6 988684 

177 

7 467005 

28 

4 807355 

78 

6.285402 

128 

7 000000 

178 

7 475733 

29 

4.857981 

79 

6.303780 

129 

7 011227 

179 

7 483815 

30 

4.906890 

80 

6 321928 

130 

7 022367 

180 

7 491853 

31 

4.954196 

81 

6 339850 

131 

7 033423 

181 

7 499846 

32 

5.000000 

82 

6 357552 

132 

7.044394 

182 

7 507794 

33 

5 044394 

83 

6 375039 

133 

7 055282 

183 

7 515699 

34 

5.087463 

84 

6 392317 

134 

7 066089 

184 

7.523562 

35 

5.129283 

85 

6 409391 

135 

7 076815 

185 

7.531381 

36 

5.169925 

86 

6.426264 

136 

7 087462 

186 

7 539158 

37 

5.209453 

87 

6 442943 

137 

7.098032 

187 

7 546894 

38 

5 247927 

88 

6.459431 

138 

7.108524 

188 1 

7 . 554588 

39 

5.285402 

89 

6 475733 

139 

7 118941 

189 

7 502242 

40 

5.321928 

90 

6 491853 

140 

7 129283 

190 

7 569855 

41 

5 357552 

91 

6.507794 

141 

7 139551 

191 

7 . 577428 

42 

5 392317 

92 

6 523562 

142 

7.149747 

192 

7 . 584962 

43 

5.426264 

93 

6 539158 

143 

7 159871 

193 

7 . 592157 

44 

5.459431 

94 

6 554588 

144 

7 169925 

194 

7.599912 

45 

5.491853 

95 

6.569855 

145 

7 179909 

195 

7.607330 

40 

5.523562 

96 

6.584962 

146 

7 189824 

196 

7.614709 

47 

5.554589 

97 

6.599912 

147 

7 199672 

197 

7 622051 

48 

5.584962 

98 

6.614709 

148 

7 209453 

198 

7.629356 

49 

5 614710 

99 

6.629356 

149 

7 219168 

199 

7 . 63(i624 

50 

5.643856 

100 

6.643856 

150 

7.228818 

200 

7.643856 


472 


ADDITIONAL NOTES AND TABLES 


Table T-6. Logarithm to the Base 2 {Continued) 


7.651051 
7.658211 
7 665336 
7 672425 
7 . 670480 
7 686500 
7.603487 
7 700430 
7 707350 
7 714245 
7 721000 
7 727020 
7 734700 
7 741467 
7 748102 
7 754887 
7 761551 
7.768184 
7.774787 
7 781350 
7 787002 
7 704415 
7.800800 
7 807354 
7 813781 
7 820170 
7 826548 
7 832800 
7 830203 
7 845400 
7.851740 
7 857081 
7 864186 
7.870364 
7 876516 
7.882643 
7 888743 
7.804817 
7 000866 
7 006800 
7 912880 
7.018863 
7 024812 
7.030737 
7.036638 
7 942514 
7.048367 
7 954106 
7.960002 
7.965784 


7 

071543 

7 

977280 

7 

082003 

7. 

988684 

7 

094353 

8 

000000 

8 

005624 

8 

011227 

8. 

016808 

8. 

022367 

8. 

027006 

8, 

033423 

8 

038018 

8 

044304 

8 

040848 

8 

055282 

8 

060696 

8 

066080 

8 

071462 

8 

076815 

8 

082140 

8 

087463 

8 

092757 

8 

008032 

8 

103287 

8 

108524 

8 

113742 

8 

118011 

8 

124121 

8 

129283 

8 

134426 

8 

130551 

8 

14 1658 

8 

140747 

8 

154818 

8 

150871 

8 

164007 

8 

160025 

8 

174025 

8 

170000 

8 

184875 

8 

180824 

8 

104757 

8 

190672 

8 

204571 

8 

. 200453 

8 

214319 

8 

.219168 

8 

224001 

8 

.228818 


J 233610 
\ 238404 
i 243174 
1.247927 
1 252665 
1 257387 
1 262094 
1.266786 
1.271463 
1.276124 
1 280770 
1 285402 
1 200018 
1 204620 
1 200208 
1 303780 
1.308330 
1 312883 
1 317412 
1 321028 
1 326420 
1 330916 
1 335300 
1 330850 
1 344206 
1 348728 
1 353116 
1.357552 
1 361043 
1 366322 
1 370687 
1 375030 
1 379378 
1 383704 
1.388017 
1.302317 
1 396604 
1 400879 
1.405141 
1 400300 
1 413628 
1 417852 
1 422064 
1 426264 
1 430452 
1.434628 
1.438791 
1.442943 
1.447083 
1.451211 


8.455327 
8.450431 
8 463524 
8 467605 
8 471675 
8 475733 
8.479780 
8.483815 


8 

50.^825 

8 

507704 

8 

511752 

8 

515600 

8 

510(i36 

8 

523561 

8 

527476 

8 

531381 

8 

535275 

8 

530158 

8 

513031 

8 

546804 

8 

550746 

8 

554588 

8 

558120 

8 

51)2242 

8 

560054 

8 

51)0855 

8 

573647 

8. 

577428 

8 

581200 

8 

584962 

8 

588714 

8 

502457 

8 

506180 

8 

509912 

8 

603626 

8 

607330 

8 

611024 

8 

. 614709 

8 

618385 

8 

.622051 

8 

.625708 

8 

629356 

8 

.632905 

8 

636624 

8 

640244 

8 

.643856 












ADDITIONAL NOTES AND TABLES 


473 


Table T-G. Loc.auithm to the Base 2 (Continued) 



Log N 


Log N 

N 

Log N 

N 

Log iV 

401 

8.647158 

451 

8.816983 

501 

8.068666 

551 

9.10.1908 

402 

8.651051 

452 

8 820178 

502 

8.971543 

552 

9.108524 

403 

8,654636 

453 

8.823367 

503 

8.074414 

553 

0.111135 

404 

8 658211 

454 

8 826548 

504 

8 077270 

554 

9.113742 

405 

8.661778 

455 

8.829722 

505 

8.080130 

555 

0.116343 

400 

8.665335 

456 

8.832889 

506 

8.082003 

556 

9.118941 

407 

8.668885 

457 

8 836050 

507 

8.085841 

557 

0 121533 

408 

8.672125 

458 

8 839203 

508 

8 ‘>88684 

558 

0.124121 

409 

8.675956 

459 

8.812350 

500 

8.991521 

550 

9 120704 

410 

8.679480 

460 

8 845490 

510 

8 001353 

560 

9.129283 

411 

8 682994 

461 

8 848622 

511 

8 007170 

561 

9 131857 

412 

8 686500 

162 

8,851718 

512 

0.000000 

562 

9.134426 

413 

8 . 680!)lt7 

463 

8.851868 

513 

0 002815 

563 

9.136991 

414 

8 693486 

461 

8.857980 

514 

0 . 005624 

564 

9.139551 

415 

8.696067 

165 

8.861086 

515 

0 . 008428 

565 

0 142107 

41 G 

8 700439 

466 

8.864186 

516 

9.011227 

566 

9.144658 

417 

8 703903 

467 

8.867278 

517 

0.014020 

567 

0.147205 

418 

8 707;{50 

468 

8.870364 

518 

0.016808 

568 

9.149747 

119 

8 710806 

469 

8 873444 

510 

9.019590 

560 

9 152285 

420 

8.711215 

470 

8 876516 

520 

0 022367 

570 

9.154818 

421 

8 717676 

471 

8.879583 

521 

0.025130 

571 

9.157346 

422 

8.721099 

472 

8 882613 

522 

0 027‘>06 

572 

9.1.59871 

423 

8.724513 

473 

8 885696 

523 

9 030667 

573 

0 162301 

424 

8.727920 

474 

8.888743 

524 

0.033423 

574 

9 164907 

425 

8.731318 

475 

8.891783 

525 

0.036173 

575 

9.167418 

42G 

8.734709 

476 

8.894817 

526 

0.038018 

576 

9.169925 

427 

8 738092 

477 

8.897845 

527 

0.011650 

577 

9.172427 

428 

8 741466 

478 

8.900866 

528 

0.044304 

578 

9.174925 

429 

8 714833 

479 

8.903881 

520 

0.047123 

570 

9.177419 

430 

8.748192 

480 

8 906890 

530 

0.010818 

580 

9.179909 

431 

8 751544 

481 

8.909893 

531 

0.052568 

581 

9.182394 

432 

8.754887 

482 

8 912889 

532 

0 055282 

582 

9.184875 

433 

8.758223 

483 

8.915879 

533 

0.057001 

583 

9 187352 

434 

8.761551 

484 

8.918863 

534 

0.060606 

584 

9.189824 

435 

8.764871 

485 

8 921840 

535 

0.063305 

585 

9 192292 

436 

8.768184 

486 

8.924812 

536 

0.066080 

586 

9. 194757 

437 

8.771489 

487 

8.927777 

537 

0.068778 

587 

9 197216 

438 

8 774786 

488 

8.930737 

538 

0.071462 

588 

9.199072 

439 

8.778077 

489 

8 933690 

530 

0 074141 

580 

9 202123 

440 

8.781359 

490 

8 936637 

540 

0.076815 

500 

9.201571 

441 

8 784634 

491 

8 939579 

541 

0.070484 

501 

9.207014 

442 

8 787902 

492 

8.942514 

542 

9,082140 

502 

9.209453 

443 

8.791162 

493 

8 945443 

543 

9 084808 

503 

9.211888 

444 

8.794415 

494 

8 948367 

544 

0.087462 

504 

9.214319 

445 

8 797661 

495 

8 951284 

545 

0 000112 

505 

9.216745 

446 

8.800899 

496 

8 954196 

546 

0.092757 

506 

9.219168 

447 

8.804130 

497 

8.957102 

547 

0 095307 

597 

9 221587 

448 

8 807354 

498 

8 960001 

548 

9.008032 

508 

9.224001 

449 

8.810571 

499 

8 962896 

549 

0.100662 

599 

9.226412 

450 

8.813781 

500 

8.965784 

550 

9.103287 

600 

9.228818 


474 


ADDITIONAL NOTES AND TABLES 


Table T-C. Logarithm to the Base 2 {Continued) 


N 

Log N 

N 

Log N 

601 

9.231221 

651 

9.346513 

602 

9.233619 

652 

9.348728 

oo:^ 

9.236014 

653 

9.350939 

604 

9.238404 

654 

9.353146 

605 

9.240791 

655 

9.355351 

606 

9.243174 

656 

9.357552 

607 

9.245552 

657 

9.359749 

608 

9.247927 

658 

9.361943 

609 

9.250298 

659 

9.364134 

610 

9.252665 

660 

9.366322 

611 

9.255028 

661 

9.368506 

612 

9 257387 

662 

9.370687 

613 

9.259743 

663 

9.372865 

614 

9.262094 

664 

9.375039 

615 

9.264442 

665 

9.377210 

()16 

9.266786 

666 

9.379378 

617 

9.269126 

667 

9.381542 

618 

9.271463 

668 

9.383704 

619 

9.273795 

669 

9.385862 

620 

9.276124 

670 

9.388017 

621 

9 278449 

671 

9.390169 

622 

9.280770 

672 

9.392317 

623 

9.283088 

673 

9.394462 

624 

9.285402 

674 

9,396604 

625 

9.287712 

675 

9.398743 

626 

9.290018 

676 

9.400879 

627 

9.292321 

677 

9.403012 

628 

9 294620 

678 

9.405141 

629 

9.296916 

679 

9.407267 

630 

9.299208 

680 

9.409390 

631 

9 . 301496 

681 

9.411511 

632 

9.303780 

682 

9.413628 

633 

9 . 306061 

683 

9.415741 

634 

9.308339 

684 

9.417852 

635 

9.310612 

685 

9.419960 

636 

9.312883 

686 

9 422064 

637 

9.315149 

687 

9.424166 

638 

9.317412 

688 

9.426264 

639 

9.319672 

689 

9 . 428360 

640 

9.321928 

690 

9.430452 

641 

9.324180 

691 

9 432541 

642 

9.326429 

692 

9.434628 

643 

9.328674 

693 

9.436711 

644 

9.330916 

694 

9.438791 

645 

9.333155 

695 

9.440869 

646 

9.335390 

696 

9.442943 

647 

9.337621 

697 

9.445014 

648 

9.339860 

698 

9.447083 

649 

9 342074 

699 

9.449148 

650 

9.344296 

700 

9 451211 


N 

Log N 

N 

Log N 

701 

9.453270 

751 

9.552669 

702 

9.455327 

752 

9 . 554588 

703 

9.457380 

753 

9.556506 

704 

9.459431 

754 

9.558420 

705 

9.461479 

755 

9.560332 

706 

9.463524 

756 

9.562242 

707 

9.465566 

757 

9.564149 

708 

9.467605 

758 

9 566053 

709 

9.469641 

759 

9 . 5d^7956 

710 

9.471675 

760 

9.569855 

711 

9.473705 

761 

9.571752 

712 

0.475733 

762 

9.57;j647 

713 

9.477758 

763 

9.575639 

714 

9.479780 

764 

9.577428 

715 

9.481799 

765 

9.579315 

716 

9.483815 

766 

9.581200 

717 

9.485829 

767 

9.583082 

718 

9.487840 

768 

9.584962 

719 

9.489848 

769 

9.586839 

720 

9.491853 

770 

9.588714 

721 

9.493855 

771 

9.590587 

722 

9.495855 

772 

9.592457 

723 

9.497851 

773 

9 . 594324 

724 

9.499846 

774 

9.59618!) 

725 

9.501837 

775 

9.598052 

726 

9.503825 

776 

9.599912 

727 

9.505811 

777 

9.601770 

728 

9.507794 

778 

9.603626 

729 

9.509774 

779 

9.605479 

730 

9.511752 

780 

9.607330 

731 

9.513727 

781 

9 609178 

732 

9.515699 

782 

9 611024 

733 

9 517669 

783 

9.612868 

734 

9 519636 

784 

9 614709 

735 

9 521600 

785 

9.616548 

736 

9 523561 

786 

9 618385 

737 

9 525520 

787 

9.620219 

738 

9.527476 

788 

9.622051 

739 

9.529430 

789 

9 623881 

740 

9.531381 

790 

9.625708 

741 

9.533329 

791 

9.627533 

742 

9.535275 

792 

9 629356 

743 

9.537218 

793 

9.631177 

744 

9.539158 

794 

9.632995 

745 

9.541096 

795 

9.634811 

746 

9 543031 

796 

9.636624 

747 

9.544964 

797 

9.638435 

748 

9 . 546894 

798 

9.640244 

749 

9.548821 

799 

9.642051 

750 

9.550746 

800 

9.643856 




ADDITIONAL NOTES AND TABLES 

Table T-6. Logarithm to the Bask 2 {Continued) 


N 

Log N 

N 

Log N 

N 

Log N 

N 

Log N 

801 

9.645658 

851 

9 733015 

901 

9.815383 

951 

9.893301 

802 

9.647458 

852 

9 . 734709 

902 

9.816983 

952 

9.894817 

803 

9 . 649256 

853 

9 . 736401 

903 

9.818582 

953 

9.896332 

804 

9 651051 

854 

9 738092 

904 

9.820178 

954 

9.897845 

805 

9.562844 

855 

9.739780 

905 

9.821773 

955 

9.899356 

800 

9.654636 

856 

9 741466 

906 

9.823367 

956 

9.900866 

807 

9.656424 

857 

9 743151 

907 

9.824958 

957 

9.902375 

808 

9.658211 

858 

9 711833 

908 

9 826518 

958 

9.903881 

809 

9.659995 

859 

9.746514 

909 

9.828136 

959 

9.905386 

810 

9.661778 

860 

9 748192 

910 

9.829722 

960 

9.906890 

811 

9.663557 

861 

9 749869 

911 

9.831307 

961 

9.908392 

812 

9.665335 

8()2 

9.751544 

912 

9 832889 

962 

9 . 909893 

813 

9.667111 

863 

9.753216 

913 

9.834471 

963 

9.911391 

814 

9 . 668884 

8(i4 

9 754887 

914 

9.836050 

964 

9.912889 

815 

9 . 670656 

865 

9 756556 

915 

9.837627 

965 

9.914385 

816 

9 . 672425 

866 

9.758223 

916 

9.839203 

966 

9.915879 

817 

9.674192 

867 

9 759888 

917 

9.840777 

907 

9.917372 

818 

9.675956 

8()8 

9.761551 

918 

9.842350 

968 

9.918863 

819 

9.677719 

869 

9 763212 

919 

9 843920 

969 

9.920352 

820 

9.679479 

870 

9.764871 

920 

9.845490 

970 

9 921840 

821 

9.681238 

871 

9.766528 

921 

9 847057 

971 

9.923327 

822 

9 . 682994 

872 

9.768184 

922 

9 848()22 

972 

9.924812 

823 

9.684748 

873 

9.769837 

923 

9.850186 

973 

9 926295 

824 

9.686500 

874 

9 771489 

924 

9.851748 

974 

9.927777 

825 

9 . 688250 

875 

9.773139 

925 

9.853309 

975 

9.929258 

826 

9 699997 

876 

9.774786 

926 

9.854868 

976 

9.930737 

827 

9.691743 

877 

9.776133 

927 

9 . 856425 

977 

9.932214 

828 

9 6934 8(i 

878 

9 778077 

928 

9.857980 

978 

9.933690 

829 

9.695228 

879 

9 779719 

929 

9 859534 

979 

9.935161 

830 

9.696967 

880 

9 781359 

930 

9 861086 

980 

9 . 936637 

831 

9.698704 

881 

9 782998 

931 

9.862637 

981 

9.938109 

832 

9.700439 

882 

9.784634 

932 

9 864186 

982 

9 . 939579 

833 

9 702172 

883 

9 786269 

933 

9 865733 

983 

9.941047 

834 

9.703903 

884 

9.787902 

934 

9 867278 

984 

9.942514 

835 

9 705632 

885 

9.789533 

935 

9.868822 

985 

9.943979 

836 

9.707359 

886 

9.791162 

936 

9 870364 

986 

9.945443 

837 

9.709083 

887 

9.792790 

937 

9.871905 

987 

9.946906 

838 

9.710806 

888 

9.794415 

938 

9 873443 

988 

9.948367 

839 

9.712526 

889 

9.796039 

939 

9.874981 

989 

9 949826 

840 

9.714245 

890 

9 . 797661 

940 

9.876516 

990 

9 951284 

841 

9.715961 

891 

9.799281 

941 

9.878050 

991 

9.952741 

842 

9.717676 

892 

9.800899 

942 

9.879583 

992 

9.954196 

843 

9 719388 

893 

9.802516 

943 

9 881113 

993 

9 955649 

844 

9.721099 

894 

9.804130 

944 

9.882643 

994 

9 957102 

845 

9.722807 

895 

9.805743 

945 

9.884170 

995 

9.958552 

846 

9.724513 

896 

9 807354 

946 

9.885696 

996 

9.960001 

847 

9.726218 

897 

9 808964 

947 

9.887220 

997 

9.961449 

848 

9.727920 

898 

9.810571 

948 

9.888743 

998 

9.962896 

849 

9.729620 

899 

9.812177 

949 

9.890264 

999 

9.964340 

850 

9.731318 

900 

9.813781 

950 

9.891783 

1000 

9.965784 


476 


ADDITIONAL NOTES AND TABLES 


Table T-7. Entropy of a Discrete Binary Source 


p 

- Log P 

-P Log P 

H 

- Q Log Q 

- Log Q 

0 0001 

13 287712 

0 001329 

0 001473 

0 000144 

0.000144 

0 0005 

10 965784 

0.005483 

0 006204 

0 000721 

0 000722 

0 0010 

9 965784 

0 009966 

0 011408 

0 001442 

0.001443 

0 0015 

9 380822 

0 014071 

0 016234 

0 002162 

0 002166 

0 0020 

8 965784 

0 017932 

0 020814 

0 002882 

0.002888 

0 0025 

8 643856 

0 021610 

0 025212 

0.003602 

0 003611 

0 oo:u) 

8.380822 

0 025142 

0 029464 

0 004322 

0 004335 

0 0035 

8 158429 

0 028555 

0 033595 

0 005041 

0 005058 

0 0040 

7 965784 

0 031863 

0 037622 

0 005759 

0 0(55782 

0 0045 

7 79.5859 

0 035081 

0 041559 

0 006477 

0 0(U)507 

0 0050 

7 643856 

0 038219 

0 045415 

0 007195 

0 007232 

0 0055 

7 506353 

0 041285 

0 049198 

0 007913 

0 007\)57 

0 0000 

7 380822 

0 044285 

0 052915 

0 008630 

0.008(\p2 

0 0005 

7 265345 

0 047225 

0 0.'1().572 

0 009347 

0 009408 

0 0070 

7 I.58-J29 

0 050109 

0 060172 

0 010063 

0 010134 

0 0075 

7 058894 

0 052942 

0 063721 

0 010780 

0 010861 

0 0080 

0 90.578 J 

0 055726 

0 0()7222 

0 011195 

0 011.588 

0 0085 

6 878321 

0 058466 

0 070676 

0 012211 

0 012315 

0 0090 

6 795859 

0 061163 

0 074088 

0 012926 

0 013043 

0 0095 

6 717857 

0.063820 

0 077160 

0 013640 

0 013771 

0.0100 

6 643856 

0 066439 

0 080793 

0 011355 

0 011500 

0 0110 

6 506353 

0 071570 

0 087352 

0 015782 

0 015958 

0 0120 

6 380822 

0 076570 

0 093778 

0 017208 

0 017417 

0 0130 

6 265345 

0 081449 

0 100082 

0 018633 

0 018878 

0 0140 

6 158429 

0 086218 

0 106274 

0 020056 

0 020340 

0 0150 

6 058894 

0 090883 

0 112361 

0 021477 

0 021804 

0 0160 

5 965784 

0 095453 

0 118350 

0 022897 

0 023270 

0.0170 

5 878321 

0 099931 

0 124248 

0 024316 

0 024737 

0 0180 

5 795859 

0 104325 

0 130059 

0 025733 

0 026205 

0 0190 

5 717857 

0 108639 

0 135788 

0.027149 

0-027675 

0 0200 

5 643856 

0 112877 

0 141441 

0 028563 

0.029146 

0 0210 

5 573467 

0 117043 

0 147019 

0.029976 

0 030619 

0.0220 

5 506353 

0 121140 

0 152527 

_0 031388 

0 032094 

0 0230 

5 442222 

0.125171 

0 157969 

0 032797 

0 033570 

0.0240 

5 380822 

0 129140 

0 163346 

0 034206 

0 035047 

0 0250 

5.321928 

0 133048 

0.168661 

0 035613 

0.036526 

0.0260 

5 265345 

0 136899 

0 173917 

0.037018 

0.038006 

0.0270 

5 210897 

0 140694 

0 179116 

0 038422 

0.039488 

0.0280 

5 158429 

0 144436 

0 184261 

0 039825 

0.040972 

0 0290 

5.107803 

0 148126 

0 189352 

0 041226 

0 042457 


ADDITIONAL NOTES AND TABLES 477 

Table T-7. Knthopy of a Discrete Binary Source (Continued) 


P 

- Log /' 

0 0300 

5 058894 

0 0310 

5 011588 

0.0:320 

4 965784 

0 0330 

4 921:390 

0 0340 

4 878321 

0 0350 

4 836501 

0 0360 

4 795859 

0.0370 

4 756331 

0 0380 

4 717857 

0.0390 

4 680382 

0 0400 

4 643856 

0 0410 

4 6082:52 

0 0420 

4 573467 

0 0430 

4 539519 

0 0140 

4 500353 

0 0450 

4 473931 

0 0460 

4 442222 

0 0470 

4 411195 

0 0480 

4 380822 

0 0490 

4 351074 

0 0500 

4 .32 1928 

0 0510 

4 293359 

0 0520 

4 265315 

0 0530 

4 237864 

0 0540 

4 210897 

0 0550 

4 184425 

0 0560 

4 158429 

0 0570 

4 i:328 94 

0 0580 

4 107803 

0.0590 

4 083141 

0 0600 

4,058894 

0 0625 

4 000000 

0 0650 

3 94:3416 

0 0675 

3 888969 

0 0700 

3 836501 

0 0725 

3 785875 

0 0750 

3 736966 

0 0775 

3 689660 

0 0800 

3.643856 

0 0825 

3.599462 


— P Log P H 

0 1517G7 0 194‘i92 
0 155:159 0 190382 
0 158905 0 204325 
0.1G2406 0 209220 
0 1G58G3 0 214071 

0 1G9278 0 218878 
0 172G51 0 223G12 
0 175984 0 2283G4 
0 179279 0 23304G 
0 -1825:15 0 2:i7688 

0 185751 0 242292 
0 1889:i8 0 24G858 
0 i9208G 0 25i:i88 
0 195199 0 255882 
0 198280 0 2G0341 

0 20i:i27 0 2G17G5 
0 204342 0 2G915G 
0 207320 0.273514 
0 210279 0 277840 
0.21:3203 0 282134 

0 21 0090 0 280:i97 
0 2189G1 0 290G:30 
0 221798 0 2918:13 
0 224G07 0 299007 
0.227:388 0 303152 

0 230143 0 :3072G8 
0 232872 0 31 1357 
0 235575 0 315419 
0 238253 0 319454 
0,240905 0 323402 

0 2435:34 0 327445 
0 250000 0 337290 
0.250:322 0 :34G981 
0 262505 0 35G524 
0 268555 0 365924 

0 27447G 0 ‘375185 
0 280272 0 384312 
0 285949 0 393308 
0 291508 0.402179 
0.296956 0.410927 


- Q Log Q 

- Log Q 

0 042625 

0 043943 

0 014023 

0 045431 

0 045420 

0 046921 

0 046815 

0 048412 

0 048208 

0 049905 

0 049600 

0 05i:i99 

0 050991 

0 052895 

0 052380 

0 054392 

0 053767 

0 055891 

0 055153 

0 057392 

0.0565:38 

0 058891 

0 057921 

0 060397 

0.059:303 

0 001902 

0 06008:3 

0 003109 

0 0620(»1 

0 004917 

0 003438 

0 066427 

0 061814 

0 067939 

0 00(>t88 

0 069452 

0,067560 

0 070967 

0 068<)3l 

0 072483 

0 070301 

0 074001 

0 07l6t)8 

0 075520 

0 073035 

0 077041 

0 074400 

0 078564 

0 075703 

0 080088 

0 077125 

0 081614 

0 078485 

0 083141 

0 079844 

0 084670 

0 081201 

0 086201 

0 082557 

0.0877.33 

0 08391 1 

0 089267 

0 087290 

0.093109 

0 090659 

0 096962 

0 094019 

0 100824 

0 097369 

0 104697 

0 100709 

0 108581 

0 1040:39 

0.112475 

0 107:i60 

0 116379 

0 110671 

0 120294 

0.113972 

0 124220 


478 


ADDITIONAL NOTES AND TABLES 


Table T-7. Entropy of a Discrete Binary Source (Continued) 


p 

- Log P 

0 0850 

3 556393 

0.0875 

3 514573 

0 ()9(X) 

3 473931 

0 0925 

3.434403 

0 0950 

3 395929 

0.0975 

3.358454 

0 1000 

3 321928 

0 1025 

3 . 286304 

0.1050 

3 251539 

0.1075 

3.217591 

0.1100 

3 184425 

0.1125 

3 . 152003 

0.1150 

3 120294 

0 1175 

3.089267 

0 1200 

3 058894 

0 1225 

3 029146 

0 1250 

3 000000 

0 1275 

2 97143! 

0.1300 

2 943416 

0.1325 

2 915936 

0 1350 

2 888969 

0.1375 

2 862496 

0 1400 

2 836501 

0 1425 

2 810966 

0 1450 

2 785875 

0 1475 

2 761213 

0.1500 

2 736966 

0 1525 

2 713119 

0 1550 

2 689660 

0 1575 

2 666576 

0 1600 

2 643856 

0 1625 

2 621488 

0.1650 

2.599462 

0 1675 

2 . 577767 

0.1700 

2 556393 

0.1725 

2.535332 

0 1750 

2.514573 

0 1775 

2 494109 

0.1800 

2 473931 

0.1825 

2.454032 


-PLourp H 

0.302293 0 419556 
0 307525 0.428070 
0 312654 0 436470 
0 317682 0 444760 
0.322613 0.452943 

0 327449 0 461020 
0.332193 0 468996 
0.336846 0 476871 
0 341412 0.484648 
0.345891 0 492329 

0.350287 0 499916 
0 354600 0 507411 
0 358834 0.514816 
0 362989 0 522132 
0 367067 0 529361 

0 371070 0 536505 
0 375000 0 543564 
0 378857 0 550542 
0.382644 0 557438 
0 386361 0 564255 

0 390011 0 570993 
0 393593 0.577654 
0 397110 0.584239 
0 400563 0 590749 
0 403952 0 597185 

0 407279 0 603549 
0 410545 0.609840 
0 413751 0 616061 
0 416897 0 622213 
0 419986 0 628295 

0 423017 0 634310 
0.425992 0 640257 
0.428911 0 646138 
0.431776 0.651954 
0.434587 0 657705 

0 437345 0 663392 
0 440050 0.669016 
0 442704 0.674577 
0.445308 0 680077 
0.447861 0 685516 


-QLog Q 

- Log Q 

0 117263 

0 128156 

0.120544 

0.132104 

0.123816 

0 136062 

0 127078 

0 140030 

0.130329 

0 144010 

0.133571 

0 148001 

0 136803 

0 152003 

0 140024 

0 1^6016 

0 143236 

0 160040 

0.146438 

0 1^4076 

0 149629 

0 16^123 

0 152811 

0 172181 

0 155982 

0 176251 

0 159143 

0 180332 

0 162294 

0 184425 

0 165434 

0 188529 

0 168564 

0 192645 

0 171684 

0 196773 

0 174794 

0 200413 

0 177893 

0.205064 

0 180982 

0 209228 

0 184061 

0 213404 

0 187129 

0 217591 

0 190186 

0-221791 

0.193233 

0 226004 

0 196270 

0 230228 

0 199295 

0 234465 

0 202311 

0 238715 

0 205315 

0 242977 

0 208309 

0 247251 

0 211293 

0 251539 

0 214265 

0 255839 

0 217227 

0.260152 

0 220178 

0.264478 

0.223118 

0 268817 

0 226047 

0 273169 

0 228966 

0 277534 

0 231873 

0 281912 

0 234769 

0 286304 

0.237655 

0 290709 



ADDITIONAL NOTES AND TABLES 


479 


Table T-7. Entropy of a Discrete Binary Source {Continued) 


p 

- LogP 

-P Log P 

0.1850 

2 434403 

0 450365 

0 1875 

2.415037 

0 452820 

0 1900 

2.395929 

0 455226 

0.1925 

2.377070 

0 457586 

0.1950 

2 358454 

0 459899 

0 1975 

2 340075 

0 462165 

0.2000 

2.321928 

0 464386 

0 2050 

2 286304 

0 468692 

0.2100 

2.251539 

0 472823 

0 2150 

2.217591 

0 476782 

0 2200 

2.184425 

0 480573 

0 2250 

2 152003 

0.484201 

0.2300 

2 120294 

0 487668 

0 2350 

2.089267 

i 0 490978 

0 2400 

2 058894 

0 494134 

0 2450 

2 029146 

0.497141 

0 2500 

2 000000 

0 500000 

0.2550 

1 971431 

0 502715 

0 2000 

1 943410 

0 505288 

0 2050 

1 915936 

0.507723 

0.2700 

1.888969 

0 510022 

0.2750 

1.802496 

0 512187 

0 2800 

1 836501 

0 5L4220 

0 2850 

1 810966 

0 516125 

0 2900 

1 . 785875 

0 517904 

0 2950 

1 761213 

0 519558 

0.3000 

1 .736906 

0 521090 

0 3050 

1 713119 

0.522501 

0.3100 

1.689660 

0 523795 

0.3150 

1 666576 

0 524972 

0 3200 

1.643850 

0 526034 

0 3250 

1 621488 

0 526984 

0 3300 

1 599462 

0.527822 

0 3350 

1 577767 

0.528552 

0.3400 

1.556393 

0 529174 

0 3450 

1.535332 

0.529689 

0.3500 

1.514573 

0 530101 

0.3550 

1.494109 

0.530409 

0 3600 

1 .473931 

0.530615 

0 3050 

1 454032 

0 530722 


H 

-QLog Q 

- Log Q 

0.690894 

0.240529 

0.295128 

0.696212 

0.243393 

0.299560 

0.701471 

0.246245 

0.304006 

0.706672 

0.249086 

0.308466 

0 711815 

0.251916 

0.312939 

0.716900 

0.254735 

0 317427 

0.721928 

0.257542 

0 321928 

0.731816 

0 263124 

0 330973 

0.741483 

0 268660 

0 340075 

0 750932 

0.274150 

0.349235 

0.760167 

0 279594 

0.358454 

0 769193 

0.284992 

0.367732 

0 778011 

0.290344 

0.377070 

0.786626 

0 295648 

0.386468 

0 795040 

0.300906 

0 395929 

0.803257 

0.306116 

0.405451 

0.811278 

0.311278 

0 415037 

0 819107 

0 316392 

0.424688 

0.826746 

0 321458 

0 434403 

0.834198 

0.326475 

0.444184 

0.841465 

0.331443 

0.454032 

0.848548 

0 336362 

0.463947 

0.855451 

0.341230 

0 473931 

0.862175 

0 346049 

0 483985 

0.868721 

0.350817 

0 494109 

0.875093 

0.355535 

0.504305 

0.881291 

0.360201 

0.514573 

0.887317 

0.364816 

0.524915 

0 893173 

0.369379 

0.535332 

0.898861 

0.373890 

0.545824 

0.904381 

0.378347 

0 556393 

0.909736 

0.382752 

0 567041 

0 914926 

0,387104 

0 577767 

0 919953 

0.391402 

0.588574 

0.924819 

0.395645 

0 599462 

0.929523 

0 399834 

0.610433 

0 934068 

0.403967 

0.621488 

0.938454 

0.408046 

0.632629 

0.942683 

0.412068 

0 643856 

0.946755 

0 416034 

0 655171 


480 


ADDITIONAL NOTES AND TABLES 


Table T-7. Entropy of a Discrete Binary Source (Continued) 


p 

-ix)g p 

— P Lo|5 P 

// 

- 0 Log Q 

-Log Q 

0 J^700 

1 434403 

0 530729 

0 950072 

0 419943 

0 000570 

0 

1 415037 

0 530039 

0 954434 

0 423795 

0 078072 

0 :i800 

1 395929 

0 530453 

0 958042 

0 427589 

0 089000 

0 3850 

1 377070 

0 530172 

0 901497 

0 431325 

0 701342 

0 3900 

1 358454 

0 529797 

0 904800 

0 435002 

0 713119 

0 3950 

1 340075 

0 529330 

0 907951 

0 438021 

0.724993 

0 4000 

1 321928 

0 528771 

0 970951 

0 442179 

0 730900 

0.4050 

1 304000 

0 528122 

0 973800 

0 445078 

0 7^9038 

0 4100 

1 280304 

0 527385 

0 970500 

0 449110 

0 701213 

0.4150 

1 208817 

0 520559 

0 979051 

0 452193 

0 77^1491 

0 4200 

1 251539 

0 525040 

0 981154 

0 455808 

0 785875 

0 4250 

1.234405 

0 524018 

0 983708 

0 459001 

0 798300 

0 4300 

1 217591 

0 523504 

0 985815 

0.402251 

0 8Kh\00 

0 4350 

1 200913 

0 522397 

0 987775 

0 4(;>5378 

0 823077 

0 4400 

1 184425 

0 521147 

0 989588 

0 408441 

0 830501 

0 4450 

1 108123 

0 519815 

0.901254 

0 471439 

0 849440 

0 4500 

1 152003 

0 518401 

0 992774 

0 471373 

0 802490 

0 4550 

1 130002 

0 510908 

0 994149 

0 477241 

0 875072 

0 4000 

1 120294 

0 515335 

0 995378 

0 480043 

0 888909 

0.4050 

1 104097 

0 513084 

0 990402 

0 482778 

0 902389 

0.4700 

1 089207 

0 511950 

0 997402 

0 485140 

0 915930 

0 4750 

1 074001 

0 510150 

0 998190 

0 488040 

0 929011 

0.4800 

1.058891 

0.508209 

0 998840 

0 490577 

0 943410 

0 4850 

1 043943 

0 500313 

0 999351 

0 493038 

0 957350 

0 4900 

1 029140 

0 504282 

0 999711 

0 495430 

0 971431 

0 4950 

1 014500 

0 502177 

0 999928 

0 497751 

0.985045 

0 5000 

1 000000 

0 500000 

1 000000 

0 500000 

1 000000 




BIBLIOGRAPHY 


REFERENCE BOOKS ON PROBABILITY THEORY 

Bartlett, M. S.; “An Introdiirtion to Stochastic Processes, ” Cainbridpe University 
Press, New York, 1955. 

Carnap, Kiidolf; "LoKi<'al Foundations of Probability,” University of Chicafro Press, 
ChicaKo, 1950. 

Cramer, ITarald: "Mathematical Methods of Statistics,” IVinceton University Press, 
Prineeton, N.J., 1916. 

Dannois, C.: "C/alcul des probabilit^s,” Centre de doeumentfition uriiversitaire, 
Sorlionne, Pans, 1954. 

Uerman, (J., and M. Klein: "Probability and Statistical Inference,” Oxford University 
]*r(‘ss, New York, 1958. 

Doob, John L. : "Stochastic Processes,” John Wiley & Sons, Inc., New \’ork, 19511. 

Dup;ue, 1).: “Traite de statistupie theoriipie et a])pliquee,” Masson et CTic, Paris, 1958. 

Fellei, \\illiam; "Probability Theory and Its A])plications,” John Wiley & Sons, Inc., 
New York, 1950. 

Fortet, R. : "(.^alcul des probabilites, I,” Centre national de la reclKU’che seientifique, 
Palis, 1950. 

Frazer, 1). A. S.: "Statistics: An Introduction,” John Wiley & Sons, Inc., New Y'ork, 
1958. 

Jeffreys, Harold: "Theory of Probabdity,” ‘2d ed., Oxford University Press, New York, 
1948. 

Kemeny, J. O., and J. L. Snell: "Finite Markov Chains,” O. Van Nostrand Company, 
Inc., Princeton, N.J., 1900. 

Kolmogorov, Andrei N. (I): "Foundations of the Theory of Probability,” Chelsea 
Publishing C’ompany, New York, 1950. 

Lehman, K. L.: "Theory of Testing Hypotheses,” John Wiley & Sous, Inc., New York, 
1900. 

Loeve, Michel: "Probability Theory,” I), Van Nostrand Company, Inc., Princeton, 
N.J., 1955. 

Mood, Alexander M.: "Introduction to the Theory of Statistics,” McGraw-Hill Book 
Company, Inc., New York, 1950. 

Parzen, K. : "Modern Probability Theory and Its .Applications,” John Wiley & Sons, 
Inc., New York, 1900. 

Uspensky, J. V.: "Introduction to Mathematical Probability,” McGraw-Hill Book 
Company, Inc., Nc!w York, 1937. 

Wald, A.: "Statistical Decision Functions,” John Wiley & Sons, Inc., New York, 1950. 

Wilks, S. S.: "Mathematical Statistics,” Princeton University Press, Princeton, N.J,, 
1943. 


481 



